Author: Anand Bose

Feature Engineering Techniques

Data Description: The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory. Data is in raw form (not scaled).The data has 8 quantitative input variables, and 1 quantitative output variable, and 1030 instances (observations).

Domain: Material manufacturing

Context: Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.

Attribute Information slag ash water superplastic coarseagg fineagg age strength

  • cement: measured in kg in a m3 mixture
  • slag: measured in kg in a m3 mixture
  • ash: measured in kg in a m3 mixture
  • water: measured in kg in a m3 mixture
  • superplastic: measured in kg in a m3 mixture
  • coarseagg: measured in kg in a m3 mixture
  • fineagg: measured in kg in a m3 mixture
  • age: day (1~365)
  • strength: Concrete compressive strength measured in MPa

Learning Outcomes

  • Exploratory Data Analysis
  • Building ML models for regression
  • Hyper parameter tuning
In [3]:
!pip install catboost
!pip install eli5
!pip install hyperopt
Requirement already satisfied: catboost in /Users/anandbose/anaconda3/lib/python3.8/site-packages (0.24.3)
Requirement already satisfied: scipy in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from catboost) (1.5.0)
Requirement already satisfied: matplotlib in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from catboost) (3.2.2)
Requirement already satisfied: numpy>=1.16.0 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from catboost) (1.18.5)
Requirement already satisfied: plotly in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from catboost) (4.13.0)
Requirement already satisfied: pandas>=0.24.0 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from catboost) (1.0.5)
Requirement already satisfied: graphviz in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from catboost) (0.15)
Requirement already satisfied: six in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from catboost) (1.15.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from matplotlib->catboost) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from matplotlib->catboost) (2.8.1)
Requirement already satisfied: retrying>=1.3.3 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from plotly->catboost) (1.3.3)
Requirement already satisfied: pytz>=2017.2 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from pandas>=0.24.0->catboost) (2020.1)
Requirement already satisfied: eli5 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (0.10.1)
Requirement already satisfied: numpy>=1.9.0 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from eli5) (1.18.5)
Requirement already satisfied: scipy in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from eli5) (1.5.0)
Requirement already satisfied: six in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from eli5) (1.15.0)
Requirement already satisfied: tabulate>=0.7.7 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from eli5) (0.8.7)
Requirement already satisfied: scikit-learn>=0.18 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from eli5) (0.23.1)
Requirement already satisfied: graphviz in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from eli5) (0.15)
Requirement already satisfied: attrs>16.0.0 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from eli5) (19.3.0)
Requirement already satisfied: jinja2 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from eli5) (2.11.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from scikit-learn>=0.18->eli5) (2.1.0)
Requirement already satisfied: joblib>=0.11 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from scikit-learn>=0.18->eli5) (0.16.0)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from jinja2->eli5) (1.1.1)
Collecting hyperopt
  Downloading hyperopt-0.2.5-py2.py3-none-any.whl (965 kB)
     |████████████████████████████████| 965 kB 4.8 MB/s eta 0:00:01     |█████████████████▎              | 522 kB 4.8 MB/s eta 0:00:01
Requirement already satisfied: future in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from hyperopt) (0.18.2)
Requirement already satisfied: networkx>=2.2 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from hyperopt) (2.4)
Requirement already satisfied: scipy in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from hyperopt) (1.5.0)
Requirement already satisfied: tqdm in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from hyperopt) (4.47.0)
Requirement already satisfied: numpy in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from hyperopt) (1.18.5)
Requirement already satisfied: cloudpickle in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from hyperopt) (1.5.0)
Requirement already satisfied: six in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from hyperopt) (1.15.0)
Requirement already satisfied: decorator>=4.3.0 in /Users/anandbose/anaconda3/lib/python3.8/site-packages (from networkx>=2.2->hyperopt) (4.4.2)
Installing collected packages: hyperopt
Successfully installed hyperopt-0.2.5

Import Packages

In [4]:
# Imports
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from scipy import stats; from scipy.stats import zscore, norm, randint
import matplotlib.style as style; style.use('fivethirtyeight')
from collections import OrderedDict
%matplotlib inline

# Checking Leverage and Influence Points
from statsmodels.graphics.regressionplots import *
import statsmodels.stats.stattools as stools
import statsmodels.formula.api as smf
import statsmodels.stats as stats
import scipy.stats as scipystats
import statsmodels.api as sm

# Checking multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
from patsy import dmatrices

# Cluster analysis
from sklearn.cluster import KMeans

# Feature importance
import eli5
from eli5.sklearn import PermutationImportance

# Modelling
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, ExtraTreesRegressor, BaggingRegressor
from sklearn.model_selection import train_test_split, KFold, cross_val_score, learning_curve
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from catboost import CatBoostRegressor, Pool
from sklearn.svm import SVR
import xgboost as xgb

# Metrics
from sklearn.metrics import make_scorer, mean_squared_error, r2_score

# Hyperparameter tuning
from hyperopt import hp, fmin, tpe, STATUS_OK, STATUS_FAIL, Trials, space_eval
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.utils import resample

# Display settings
pd.options.display.max_rows = 400
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format

random_state = 2019
np.random.seed(random_state)

# Suppress warnings
import warnings; warnings.filterwarnings('ignore')

Read the dataset and check first five rows

In [5]:
# Reading the data as dataframe and print the first five rows
concrete = pd.read_csv('concrete.csv')
concrete.head()
Out[5]:
cement slag ash water superplastic coarseagg fineagg age strength
0 141.30 212.00 0.00 203.50 0.00 971.80 748.50 28 29.89
1 168.90 42.20 124.30 158.30 10.80 1080.80 796.20 14 23.51
2 250.00 0.00 95.70 187.40 5.50 956.90 861.20 28 29.22
3 266.00 114.00 0.00 228.00 0.00 932.00 670.00 28 45.85
4 154.80 183.40 0.00 193.30 9.10 1047.40 696.70 28 18.29

Exploratory Data Analysis

Performing exploratory data analysis on the cement dataset. Below are some of the steps performed:

  • Univariate analysis – explore data types and description of the independent attributes including name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers
  • Bi-variate analysis between the predictor variables and between the predictor variables and target column. Comment findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms or density curves.
In [6]:
print('Several helper function that was created to help in EDA and Modelling'); print('--'*60)

# Customized describe function
def custom_describe(df):
  results = []
  for col in df.select_dtypes(include = ['float64', 'int64']).columns.tolist():
    stats = OrderedDict({'': col, 'Count': df[col].count(), 'Type': df[col].dtype, 'Mean': round(df[col].mean(), 2), 'StandardDeviation': round(df[col].std(), 2), 
                         'Variance': round(df[col].var(), 2), 'Minimum': round(df[col].min(), 2), 'Q1': round(df[col].quantile(0.25), 2), 
                         'Median': round(df[col].median(), 2), 'Q3': round(df[col].quantile(0.75), 2), 'Maximum': round(df[col].max(), 2),
                         'Range': round(df[col].max(), 2)-round(df[col].min(), 2), 'IQR': round(df[col].quantile(0.75), 2)-round(df[col].quantile(0.25), 2),
                         'Kurtosis': round(df[col].kurt(), 2), 'Skewness': round(df[col].skew(), 2), 'MeanAbsoluteDeviation': round(df[col].mad(), 2)})
    if df[col].skew() < -1:
      if df[col].median() < df[col].mean():
        ske = 'Highly Skewed (Right)'
      else:
        ske = 'Highly Skewed (Left)'
    elif -1 <= df[col].skew() <= -0.5:
      if df[col].median() < df[col].mean():
        ske = 'Moderately Skewed (Right)'
      else:
        ske = 'Moderately Skewed (Left)'
    elif -0.5 < df[col].skew() <= 0:
      if df[col].median() < df[col].mean():
        ske = 'Fairly Symmetrical (Right)'
      else:
        ske = 'Fairly Symmetrical (Left)'
    elif 0 < df[col].skew() <= 0.5:
      if df[col].median() < df[col].mean():
        ske = 'Fairly Symmetrical (Right)'
      else:
        ske = 'Fairly Symmetrical (Left)'
    elif 0.5 < df[col].skew() <= 1:
      if df[col].median() < df[col].mean():
        ske = 'Moderately Skewed (Right)'
      else:
        ske = 'Moderately Skewed (Left)'
    elif df[col].skew() > 1:
      if df[col].median() < df[col].mean():
        ske = 'Highly Skewed (Right)'
      else:
        ske = 'Highly Skewed (Left)'
    else:
      ske = 'Error'
    stats['SkewnessComment'] = ske
    upper_lim, lower_lim = stats['Q3'] + (1.5 * stats['IQR']), stats['Q1'] - (1.5 * stats['IQR'])
    if len([x for x in df[col] if x < lower_lim or x > upper_lim])>1:
      out = 'HasOutliers'
    else:
      out = 'NoOutliers'
    stats['OutliersComment'] = out
    results.append(stats)
  statistics = pd.DataFrame(results).set_index('')

  return display(statistics)

# Functions that will help us with EDA plot
def odp_plots(df, col):
    f,(ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (15, 7.2))
    
    # Boxplot to check outliers
    sns.boxplot(x = col, data = df, ax = ax1, orient = 'v', color = 'darkslategrey')
    
    # Distribution plot with outliers
    sns.distplot(df[col], ax = ax2, color = 'teal', fit = norm, rug = True).set_title(f'{col} with outliers')
    ax2.axvline(df[col].mean(), color = 'r', linestyle = '--', label = 'Mean', linewidth = 1.2)
    ax2.axvline(df[col].median(), color = 'g', linestyle = '--', label = 'Median', linewidth = 1.2)
    ax2.axvline(df[col].mode()[0], color = 'b', linestyle = '--', label = 'Mode', linewidth = 1.2); ax2.legend(loc = 'best')
    
    # Removing outliers, but in a new dataframe
    upperbound, lowerbound = np.percentile(df[col], [1, 99])
    y = pd.DataFrame(np.clip(df[col], upperbound, lowerbound))
    
    # Distribution plot without outliers
    sns.distplot(y[col], ax = ax3, color = 'tab:orange', fit = norm, rug = True).set_title(f'{col} without outliers')
    ax3.axvline(y[col].mean(), color = 'r', linestyle = '--', label = 'Mean', linewidth = 1.2)
    ax3.axvline(y[col].median(), color = 'g', linestyle = '--', label = 'Median', linewidth = 1.2)
    ax3.axvline(y[col].mode()[0], color = 'b', linestyle = '--', label = 'Mode', linewidth = 1.2); ax3.legend(loc = 'best')
    
    kwargs = {'fontsize':14, 'color':'black'}
    ax1.set_title(col + ' Boxplot Analysis', **kwargs)
    ax1.set_xlabel('Box', **kwargs)
    ax1.set_ylabel(col + ' Values', **kwargs)

    return plt.show()

# Correlation matrix for all variables
def correlation_matrix(df, threshold = 0.8):
    corr = df.corr()
    mask = np.zeros_like(corr, dtype = np.bool)
    mask[np.triu_indices_from(mask)] = True
    f, ax = plt.subplots(figsize = (15, 7.2))
    cmap = sns.diverging_palette(220, 10, as_cmap = True)
    sns.heatmap(corr, mask = mask, cmap = cmap, square = True, linewidths = .5, cbar_kws = {"shrink": .5})#, annot = True)
    ax.set_title('Correlation Matrix of Data')

    # Filter for correlation value greater than threshold
    sort = corr.abs().unstack()
    sort = sort.sort_values(kind = "quicksort", ascending = False)
    display(sort[(sort > threshold) & (sort < 1)])
    
# Outliers removal
def outliers(df, col, method = 'quantile', strategy = 'median', drop = True):
    if method == 'quantile':
        Q3, Q2, Q1 = df[col].quantile([0.75, 0.50, 0.25])
        IQR = Q3 - Q1
        upper_lim = Q3 + (1.5 * IQR)
        lower_lim = Q1 - (1.5 * IQR)
        print(f'Outliers for {col} are: {sorted([x for x in df[col] if x < lower_lim or x > upper_lim])}\n')
        if strategy == 'median':
            df.loc[(df[col] < lower_lim) | (df[col] > upper_lim), col] = Q2
        else:
            df.loc[(df[col] < lower_lim) | (df[col] > upper_lim), col] = df[col].mean()
    elif method == 'stddev':
        col_mean, col_std, Q2 = df[col].mean(), df[col].std(), df[col].median()
        cut_off = col_std * 3
        lower_lim, upper_lim = col_mean - cut_off, col_mean + cut_off
        print(f'Outliers for {col} are: {sorted([x for x in df[col] if x < lower_lim or x > upper_lim])}\n')
        if strategy == 'median':
            df.loc[(df[col] < lower_lim) | (df[col] > upper_lim), col] = Q2
        else:
            df.loc[(df[col] < lower_lim) | (df[col] > upper_lim), col] = col_mean
    else:
      print('Please pass the correct method, strategy or drop criteria')

# KMeans Plots
def kmeans_plots(df, compcol):
  columns = list(set(list(df.columns))-set([compcol]))
  f, ax = plt.subplots(4, 2, figsize = (15, 15))
  ax[0][0].scatter(X[compcol], X[columns[0]], c = labels, s = 25, cmap = 'viridis'); ax[0][0].set_xlabel(compcol); ax[0][0].set_ylabel(columns[0])
  ax[0][1].scatter(X[compcol], X[columns[1]], c = labels, s = 25, cmap = 'viridis'); ax[0][1].set_xlabel(compcol); ax[0][1].set_ylabel(columns[1])
  ax[1][0].scatter(X[compcol], X[columns[2]], c = labels, s = 25, cmap = 'viridis'); ax[1][0].set_xlabel(compcol); ax[1][0].set_ylabel(columns[2])
  ax[1][1].scatter(X[compcol], X[columns[3]], c = labels, s = 25, cmap = 'viridis'); ax[1][1].set_xlabel(compcol); ax[1][1].set_ylabel(columns[3])
  ax[2][0].scatter(X[compcol], X[columns[4]], c = labels, s = 25, cmap = 'viridis'); ax[2][0].set_xlabel(compcol); ax[2][0].set_ylabel(columns[4])
  ax[2][1].scatter(X[compcol], X[columns[5]], c = labels, s = 25, cmap = 'viridis'); ax[2][1].set_xlabel(compcol); ax[2][1].set_ylabel(columns[5])
  ax[3][0].scatter(X[compcol], X[columns[6]], c = labels, s = 25, cmap = 'viridis'); ax[3][0].set_xlabel(compcol); ax[3][0].set_ylabel(columns[5])

# For rmse scoring
def rmse_score(y, y_pred):
    return np.sqrt(np.mean((y_pred - y)**2))

# Function to get top results from grid search and randomized search
def report(results):
    df = pd.concat([pd.DataFrame(results.cv_results_['params']), pd.DataFrame(results.cv_results_['mean_test_score'], columns = ['r2'])], axis = 1)
    return df
Several helper function that was created to help in EDA and Modelling
------------------------------------------------------------------------------------------------------------------------

Univariate analysis

In [7]:
# Get info of the dataframe columns
concrete.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   cement        1030 non-null   float64
 1   slag          1030 non-null   float64
 2   ash           1030 non-null   float64
 3   water         1030 non-null   float64
 4   superplastic  1030 non-null   float64
 5   coarseagg     1030 non-null   float64
 6   fineagg       1030 non-null   float64
 7   age           1030 non-null   int64  
 8   strength      1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB
In [8]:
concrete.isnull().sum()
Out[8]:
cement          0
slag            0
ash             0
water           0
superplastic    0
coarseagg       0
fineagg         0
age             0
strength        0
dtype: int64

Observation 1 - Dataset shape

Dataset has 1030 rows and 9 columns, with no missing values.

Observation 2 - Information on the type of variable

All features are of numerical types. strength is a target variable (continuous). age is a discrete feature whereas rest of them are continuous.

In [9]:
### Five point summary of numerical attributes and check unique values in 'object' columns
print('Five point summary of the dataframe'); print('--'*60)

custom_describe(concrete)
Five point summary of the dataframe
------------------------------------------------------------------------------------------------------------------------
Count Type Mean StandardDeviation Variance Minimum Q1 Median Q3 Maximum Range IQR Kurtosis Skewness MeanAbsoluteDeviation SkewnessComment OutliersComment
cement 1030 float64 281.17 104.51 10921.58 102.00 192.38 272.90 350.00 540.00 438.00 157.62 -0.52 0.51 86.78 Moderately Skewed (Right) NoOutliers
slag 1030 float64 73.90 86.28 7444.12 0.00 0.00 22.00 142.95 359.40 359.40 142.95 -0.51 0.80 76.93 Moderately Skewed (Right) HasOutliers
ash 1030 float64 54.19 64.00 4095.62 0.00 0.00 0.00 118.30 200.10 200.10 118.30 -1.33 0.54 60.42 Moderately Skewed (Right) NoOutliers
water 1030 float64 181.57 21.35 456.00 121.80 164.90 185.00 192.00 247.00 125.20 27.10 0.12 0.07 16.92 Fairly Symmetrical (Left) HasOutliers
superplastic 1030 float64 6.20 5.97 35.69 0.00 0.00 6.40 10.20 32.20 32.20 10.20 1.41 0.91 4.92 Moderately Skewed (Left) HasOutliers
coarseagg 1030 float64 972.92 77.75 6045.68 801.00 932.00 968.00 1029.40 1145.00 344.00 97.40 -0.60 -0.04 62.80 Fairly Symmetrical (Right) NoOutliers
fineagg 1030 float64 773.58 80.18 6428.19 594.00 730.95 779.50 824.00 992.60 398.60 93.05 -0.10 -0.25 61.88 Fairly Symmetrical (Left) HasOutliers
age 1030 int64 45.66 63.17 3990.44 1.00 7.00 28.00 56.00 365.00 364.00 49.00 12.17 3.27 39.12 Highly Skewed (Right) HasOutliers
strength 1030 float64 35.82 16.71 279.08 2.33 23.71 34.45 46.14 82.60 80.27 22.43 -0.31 0.42 13.46 Fairly Symmetrical (Right) HasOutliers

Observation 3 - Descriptive statistics

  • cement: Data ranges between 102 to 540, while 25th and 75th percentile is spread between 192.38 to 350. Median (272.90) is less than Mean (281.17) which means cement is moderately skewed to the right. Column has no outliers.
  • slag: Data ranges between 0 to 359.40, while 25th and 75th percentile is spread between 0 to 142.95. Median (22) is less than Mean (73.90) which means slag is moderately skewed to the right. Column has outliers.
  • ash: Data ranges between 0 to 200.10, while 25th and 75th percentile is spread between 0 to 118.30. Median (0) is less than Mean (54.19) which mean ash is moderately skewed to the right. Column has no outliers.
  • water: Data ranges between 121.80 to 247, while 25th and 75th percentile is spread between 164.90 and 192. Median (185) is greater than Mean (181.57) which means water is skewed to the left (fairly sym). Column has outliers.
  • superplastic: Data ranges between 0 to 32.20, while 25th and 75th percentile is spread between 0 to 10.20. Median (6.40) is greater than Mean (6.20) which means superplastic is moderately skewed to the left. Column has outliers.
  • coarseagg: Data ranges between 801 to 1145, while 25th and 75th percentile is spread between 932 to 1029.40. Median (968) is less than Mean (972.92) which means coarseagg is skewed to the right (fairly sym). Column has no outliers.
  • fineagg: Data ranges between 594 to 992.60, while 25th and 75th percentile is spread between 730.95 to 824. Median (779.5) is greater than Mean (773.58) which means fineagg is skewed to the left (fairly sym). Column has no outliers.
  • age: Data ranges between 1 to 365, while 25th and 75th percentile is spread between 7 to 56. Median (28) is less than Mean (45.66) which means age is highly skewed to the right. Column has no outliers.
  • strength: Data ranges between 2.33 to 82.60, while 25th and 75th percentile is spread between 23.71 to 46.14. Median (34.45) is less than Mean (35.82) which means strength is slightly skewed to the right (fairly sym). Column has no outliers.
In [10]:
# A quick check to find columns that contain outliers
print('A quick check to find columns that contain outliers, graphical'); print('--'*60)

fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(data = concrete.iloc[:, 0:-1], orient = 'h')
A quick check to find columns that contain outliers, graphical
------------------------------------------------------------------------------------------------------------------------
In [11]:
# Outlier, distribution for columns with outliers
print('Box plot, distribution of columns with and without outliers'); print('--'*60)

boxplotcolumns = list(concrete.columns)[:-1]
for cols in boxplotcolumns:
    Q3 = concrete[cols].quantile(0.75)
    Q1 = concrete[cols].quantile(0.25)
    IQR = Q3 - Q1

    print(f'{cols.capitalize()} column', '--'*40)
    print(f'Number of rows with outliers: {len(concrete.loc[(concrete[cols] < (Q1 - 1.5 * IQR)) | (concrete[cols] > (Q3 + 1.5 * IQR))])}')
    display(concrete.loc[(concrete[cols] < (Q1 - 1.5 * IQR)) | (concrete[cols] > (Q3 + 1.5 * IQR))].head())
    odp_plots(concrete, cols)

del cols, IQR, boxplotcolumns
Box plot, distribution of columns with and without outliers
------------------------------------------------------------------------------------------------------------------------
Cement column --------------------------------------------------------------------------------
Number of rows with outliers: 0
cement slag ash water superplastic coarseagg fineagg age strength
Slag column --------------------------------------------------------------------------------
Number of rows with outliers: 2
cement slag ash water superplastic coarseagg fineagg age strength
918 239.60 359.40 0.00 185.70 0.00 941.60 664.30 28 39.44
990 239.60 359.40 0.00 185.70 0.00 941.60 664.30 7 25.42
Ash column --------------------------------------------------------------------------------
Number of rows with outliers: 0
cement slag ash water superplastic coarseagg fineagg age strength
Water column --------------------------------------------------------------------------------
Number of rows with outliers: 9
cement slag ash water superplastic coarseagg fineagg age strength
66 237.00 92.00 71.00 247.00 6.00 853.00 695.00 28 28.63
263 236.90 91.70 71.50 246.90 6.00 852.90 695.40 28 28.63
432 168.00 42.10 163.80 121.80 5.70 1058.70 780.10 28 24.24
462 168.00 42.10 163.80 121.80 5.70 1058.70 780.10 100 39.23
587 168.00 42.10 163.80 121.80 5.70 1058.70 780.10 3 7.75
Superplastic column --------------------------------------------------------------------------------
Number of rows with outliers: 10
cement slag ash water superplastic coarseagg fineagg age strength
44 531.30 0.00 0.00 141.80 28.20 852.10 893.70 91 59.20
156 531.30 0.00 0.00 141.80 28.20 852.10 893.70 28 56.40
232 469.00 117.20 0.00 137.80 32.20 852.10 840.50 56 69.30
292 469.00 117.20 0.00 137.80 32.20 852.10 840.50 91 70.70
538 531.30 0.00 0.00 141.80 28.20 852.10 893.70 7 46.90
Coarseagg column --------------------------------------------------------------------------------
Number of rows with outliers: 0
cement slag ash water superplastic coarseagg fineagg age strength
Fineagg column --------------------------------------------------------------------------------
Number of rows with outliers: 5
cement slag ash water superplastic coarseagg fineagg age strength
129 375.00 93.80 0.00 126.60 23.40 852.10 992.60 91 62.50
447 375.00 93.80 0.00 126.60 23.40 852.10 992.60 7 45.70
504 375.00 93.80 0.00 126.60 23.40 852.10 992.60 3 29.00
584 375.00 93.80 0.00 126.60 23.40 852.10 992.60 56 60.20
857 375.00 93.80 0.00 126.60 23.40 852.10 992.60 28 56.70
Age column --------------------------------------------------------------------------------
Number of rows with outliers: 59
cement slag ash water superplastic coarseagg fineagg age strength
51 331.00 0.00 0.00 192.00 0.00 978.00 825.00 180 39.00
64 332.50 142.50 0.00 228.00 0.00 932.00 594.00 365 41.05
93 427.50 47.50 0.00 228.00 0.00 932.00 594.00 180 41.84
99 237.50 237.50 0.00 228.00 0.00 932.00 594.00 180 36.25
103 380.00 0.00 0.00 228.00 0.00 932.00 670.00 180 53.10
In [12]:
# Replacing outliers with mean values in these columns
print('Replacing outliers with mean values using quantile method'); print('--'*60)

concrete_im = concrete.copy(deep = True)
outliers_cols = ['slag', 'water', 'superplastic', 'fineagg', 'age']

for col in outliers_cols:
    outliers(concrete_im, col, method = 'quantile', strategy = 'mean')

print('\nColumn for which outliers where replaced with mean using quantile method: \n', outliers_cols)
Replacing outliers with mean values using quantile method
------------------------------------------------------------------------------------------------------------------------
Outliers for slag are: [359.4, 359.4]

Outliers for water are: [121.8, 121.8, 121.8, 121.8, 121.8, 236.7, 237.0, 246.9, 247.0]

Outliers for superplastic are: [28.2, 28.2, 28.2, 28.2, 28.2, 32.2, 32.2, 32.2, 32.2, 32.2]

Outliers for fineagg are: [992.6, 992.6, 992.6, 992.6, 992.6]

Outliers for age are: [180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 360, 360, 360, 360, 360, 360, 365, 365, 365, 365, 365, 365, 365, 365, 365, 365, 365, 365, 365, 365]


Column for which outliers where replaced with mean using quantile method: 
 ['slag', 'water', 'superplastic', 'fineagg', 'age']
In [13]:
print('Summary stats before outlier removal for columns with outliers'); print('--'*60); display(concrete[outliers_cols].describe().T)
print('\nSummary stats after outlier removal for columns with outliers'); print('--'*60); display(concrete_im[outliers_cols].describe().T)
Summary stats before outlier removal for columns with outliers
------------------------------------------------------------------------------------------------------------------------
count mean std min 25% 50% 75% max
slag 1030.00 73.90 86.28 0.00 0.00 22.00 142.95 359.40
water 1030.00 181.57 21.35 121.80 164.90 185.00 192.00 247.00
superplastic 1030.00 6.20 5.97 0.00 0.00 6.40 10.20 32.20
fineagg 1030.00 773.58 80.18 594.00 730.95 779.50 824.00 992.60
age 1030.00 45.66 63.17 1.00 7.00 28.00 56.00 365.00
Summary stats after outlier removal for columns with outliers
------------------------------------------------------------------------------------------------------------------------
count mean std min 25% 50% 75% max
slag 1030.00 73.34 85.35 0.00 0.00 22.00 142.73 342.10
water 1030.00 181.62 20.60 126.60 164.90 185.00 192.00 228.00
superplastic 1030.00 5.97 5.48 0.00 0.00 6.20 10.07 23.40
fineagg 1030.00 772.52 78.70 594.00 730.95 778.90 822.20 945.00
age 1030.00 33.27 27.95 1.00 7.00 28.00 45.66 120.00

Observation 4 - After imputation

A quick observation after imputating the missing values: medians remain unchanged while mean changes slightly not significantly. Type of skewness remain unchanged.

In [14]:
# A quick check to find columns that contain outliers
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(data = concrete_im.iloc[:, 0:-1], orient = 'h')
In [15]:
print('cement and strength column have a linear relationship'); print('--'*60)
sns.pairplot(concrete_im, diag_kind = 'kde')
cement and strength column have a linear relationship
------------------------------------------------------------------------------------------------------------------------
Out[15]:
<seaborn.axisgrid.PairGrid at 0x7fb58194ccd0>

Observation 5 - Pairplot comments

  • Cement and strength have a linear relationship.
  • Column that have bi/multimodal distributions are slag, ash and superplastic.

Multivariate analysis

In [16]:
for col in list(concrete_im.columns)[:-2]:
    fig, ax1 = plt.subplots(figsize = (15, 7.2), ncols = 1, sharex = False)
    sns.regplot(x = concrete_im[col], y = concrete_im['strength'], ax = ax1).set_title(f'Understanding relation between {col}, strength')

Leverage Analysis

Reference for carrying out this analysis

Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points can have an effect on the estimate of regression coefficients.

Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.

In [17]:
lm = smf.ols(formula = 'strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age', data = concrete_im).fit()
print(lm.summary())

influence = lm.get_influence()
resid_student = influence.resid_studentized_external
(cooks, p) = influence.cooks_distance
(dffits, p) = influence.dffits
leverage = influence.hat_matrix_diag

print('\n')
print('Leverage v.s. Studentized Residuals')
fig = plt.figure(figsize = (15, 7.2))
sns.regplot(leverage, lm.resid_pearson,  fit_reg = False)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               strength   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.739
Method:                 Least Squares   F-statistic:                     365.7
Date:                Sat, 05 Dec 2020   Prob (F-statistic):          1.53e-293
Time:                        01:31:41   Log-Likelihood:                -3664.9
No. Observations:                1030   AIC:                             7348.
Df Residuals:                    1021   BIC:                             7392.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       32.1465     18.667      1.722      0.085      -4.484      68.777
cement           0.1026      0.006     16.713      0.000       0.091       0.115
slag             0.0751      0.007     10.255      0.000       0.061       0.089
ash              0.0442      0.009      4.880      0.000       0.026       0.062
water           -0.1785      0.030     -5.935      0.000      -0.238      -0.119
superplastic     0.2660      0.085      3.140      0.002       0.100       0.432
coarseagg       -0.0037      0.007     -0.571      0.568      -0.017       0.009
fineagg         -0.0120      0.008     -1.569      0.117      -0.027       0.003
age              0.3200      0.010     33.513      0.000       0.301       0.339
==============================================================================
Omnibus:                       31.300   Durbin-Watson:                   1.944
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               37.393
Skew:                           0.358   Prob(JB):                     7.59e-09
Kurtosis:                       3.600   Cond. No.                     9.07e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.07e+04. This might indicate that there are
strong multicollinearity or other numerical problems.


Leverage v.s. Studentized Residuals
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb585b699a0>
In [19]:
concrete_im_res = pd.concat([pd.Series(cooks, name = 'cooks'), pd.Series(dffits, name = 'dffits'), pd.Series(leverage, name = 'leverage'), pd.Series(resid_student, name = 'resid_student')], axis = 1)
concrete_im_res = pd.concat([concrete_im, concrete_im_res], axis = 1)
concrete_im_res.head()
Out[19]:
cement slag ash water superplastic coarseagg fineagg age strength cooks dffits leverage resid_student
0 141.30 212.00 0.00 203.50 0.00 971.80 748.50 28.00 29.89 0.00 0.07 0.01 0.86
1 168.90 42.20 124.30 158.30 10.80 1080.80 796.20 14.00 23.51 0.00 -0.00 0.01 -0.02
2 250.00 0.00 95.70 187.40 5.50 956.90 861.20 28.00 29.22 0.00 0.03 0.00 0.48
3 266.00 114.00 0.00 228.00 0.00 932.00 670.00 28.00 45.85 0.00 0.21 0.01 2.49
4 154.80 183.40 0.00 193.30 9.10 1047.40 696.70 28.00 18.29 0.00 -0.10 0.01 -0.96
In [20]:
# Studentized Residual
print('Studentized residuals as a first means for identifying outliers'); print('--'*60)
r = concrete_im_res.resid_student
print('-'*30 + ' studentized residual ' + '-'*30)
display(r.describe())
print('\n')

r_sort = concrete_im_res.sort_values(by = 'resid_student', ascending = True)
print('-'*30 + ' top 5 most negative residuals ' + '-'*30)
display(r_sort.head())
print('\n')

r_sort = concrete_im_res.sort_values(by = 'resid_student', ascending = False)
print('-'*30 + ' top 5 most positive residuals ' + '-'*30)
display(r_sort.head())
Studentized residuals as a first means for identifying outliers
------------------------------------------------------------------------------------------------------------------------
------------------------------ studentized residual ------------------------------
count   1030.00
mean      -0.00
std        1.00
min       -2.83
25%       -0.62
50%       -0.06
75%        0.55
max        4.22
Name: resid_student, dtype: float64

------------------------------ top 5 most negative residuals ------------------------------
cement slag ash water superplastic coarseagg fineagg age strength cooks dffits leverage resid_student
502 500.00 0.00 0.00 200.00 0.00 1125.00 613.00 1.00 12.64 0.01 -0.35 0.02 -2.83
503 362.60 189.00 0.00 164.90 11.60 944.70 755.80 7.00 22.90 0.01 -0.23 0.01 -2.83
786 446.00 24.00 79.00 162.00 11.60 967.00 712.00 3.00 23.35 0.01 -0.22 0.01 -2.69
504 375.00 93.80 0.00 126.60 23.40 852.10 773.58 3.00 29.00 0.03 -0.55 0.05 -2.50
993 446.00 24.00 79.00 162.00 11.60 967.00 712.00 3.00 25.02 0.00 -0.20 0.01 -2.50

------------------------------ top 5 most positive residuals ------------------------------
cement slag ash water superplastic coarseagg fineagg age strength cooks dffits leverage resid_student
192 315.00 137.00 0.00 145.00 5.90 1130.00 745.00 28.00 81.75 0.02 0.44 0.01 4.22
506 451.00 0.00 0.00 165.00 11.30 1030.00 745.00 28.00 78.80 0.01 0.32 0.01 3.63
491 275.00 180.00 120.00 162.00 10.40 830.00 765.00 28.00 76.24 0.01 0.33 0.01 3.14
713 190.00 190.00 0.00 228.00 0.00 932.00 670.00 45.66 53.69 0.01 0.27 0.01 3.00
964 277.20 97.80 24.50 160.70 11.20 1061.70 782.50 28.00 63.14 0.00 0.21 0.01 2.86

We should pay attention to residuals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

In [21]:
print('Printing indexes where studentized residual exceeds +2 or -2'); print('--'*60)
res_index = concrete_im_res[abs(r) > 2].index
print(res_index)
Printing indexes where studentized residual exceeds +2 or -2
------------------------------------------------------------------------------------------------------------------------
Int64Index([   3,   44,   50,   96,  103,  128,  147,  159,  161,  192,  198,
             207,  262,  264,  272,  302,  329,  334,  349,  370,  383,  393,
             434,  452,  469,  491,  502,  503,  504,  506,  510,  518,  525,
             530,  539,  545,  556,  570,  606,  623,  632,  713,  732,  734,
             738,  762,  786,  824,  831,  902,  908,  964,  967,  973,  981,
             993,  995, 1003, 1009, 1021, 1028],
           dtype='int64')
In [22]:
print('Let\'s look at leverage points to identify observations that will have potential great influence on reg coefficient estimates.'); print('--'*60)
print('A point with leverage greater than (2k+2)/n should be carefully examined, where k is the number of predictors and n is the number of observations. In our example this works out to (2*8+2)/1030 = .017476')

leverage = concrete_im_res.leverage
print('-'*30 + ' Leverage ' + '-'*30)
display(leverage.describe())
print('\n')

leverage_sort = concrete_im_res.sort_values(by = 'leverage', ascending = False)

print('-'*30 + ' top 5 highest leverage data points ' + '-'*30)
display(leverage_sort.head())
Let's look at leverage points to identify observations that will have potential great influence on reg coefficient estimates.
------------------------------------------------------------------------------------------------------------------------
A point with leverage greater than (2k+2)/n should be carefully examined, where k is the number of predictors and n is the number of observations. In our example this works out to (2*8+2)/1030 = .017476
------------------------------ Leverage ------------------------------
count   1030.00
mean       0.01
std        0.01
min        0.00
25%        0.01
50%        0.01
75%        0.01
max        0.05
Name: leverage, dtype: float64

------------------------------ top 5 highest leverage data points ------------------------------
cement slag ash water superplastic coarseagg fineagg age strength cooks dffits leverage resid_student
129 375.00 93.80 0.00 126.60 23.40 852.10 773.58 91.00 62.50 0.02 -0.43 0.05 -1.86
584 375.00 93.80 0.00 126.60 23.40 852.10 773.58 56.00 60.20 0.00 -0.17 0.05 -0.79
504 375.00 93.80 0.00 126.60 23.40 852.10 773.58 3.00 29.00 0.03 -0.55 0.05 -2.50
447 375.00 93.80 0.00 126.60 23.40 852.10 773.58 7.00 45.70 0.00 -0.14 0.05 -0.65
857 375.00 93.80 0.00 126.60 23.40 852.10 773.58 28.00 56.70 0.00 -0.03 0.05 -0.13
In [23]:
print('Printing indexes where leverage exceeds +0.017476 or -0.017476'); print('--'*60)
lev_index = concrete_im_res[abs(leverage) > 0.017476].index
print(lev_index)
Printing indexes where leverage exceeds +0.017476 or -0.017476
------------------------------------------------------------------------------------------------------------------------
Int64Index([  21,   44,   63,   66,   95,  129,  156,  212,  232,  234,  263,
             292,  300,  307,  447,  452,  469,  490,  504,  538,  540,  553,
             556,  584,  608,  614,  615,  740,  741,  744,  788,  816,  817,
             826,  838,  846,  857,  869,  889,  902,  908,  918,  950,  955,
             973,  990, 1000, 1026],
           dtype='int64')
In [24]:
print('Let\'s take a look at DFITS. The conventional cut-off point for DFITS is 2*sqrt(k/n).')
print('DFITS can be either positive or negative, with numbers close to zero corresponding to the points with small or zero influence.'); print('--'*60)

import math
dffits_index = concrete_im_res[concrete_im_res['dffits'] > 2 * math.sqrt(8 / 1030)].index
print(dffits_index)
Let's take a look at DFITS. The conventional cut-off point for DFITS is 2*sqrt(k/n).
DFITS can be either positive or negative, with numbers close to zero corresponding to the points with small or zero influence.
------------------------------------------------------------------------------------------------------------------------
Int64Index([   3,   50,   86,  103,  128,  147,  159,  161,  192,  198,  207,
             262,  273,  302,  313,  320,  323,  329,  349,  370,  393,  452,
             469,  491,  506,  539,  545,  570,  593,  608,  623,  632,  713,
             732,  824,  918,  935,  964,  995, 1003, 1017, 1028],
           dtype='int64')
In [25]:
set(res_index).intersection(lev_index).intersection(dffits_index)
Out[25]:
{452, 469}
In [26]:
print('Let\'s run the regression again without 452 and 469 row'); print('--'*60)
concrete_im.drop([452, 469], axis = 0, inplace = True)
print(concrete_im.shape)

lm1 = smf.ols(formula = 'strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age', data = concrete_im).fit()
print(lm1.summary())
Let's run the regression again without 452 and 469 row
------------------------------------------------------------------------------------------------------------------------
(1028, 9)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               strength   R-squared:                       0.742
Model:                            OLS   Adj. R-squared:                  0.740
Method:                 Least Squares   F-statistic:                     366.0
Date:                Sat, 05 Dec 2020   Prob (F-statistic):          2.05e-293
Time:                        01:32:09   Log-Likelihood:                -3652.6
No. Observations:                1028   AIC:                             7323.
Df Residuals:                    1019   BIC:                             7368.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       31.8243     18.589      1.712      0.087      -4.652      68.301
cement           0.1011      0.006     16.501      0.000       0.089       0.113
slag             0.0740      0.007     10.153      0.000       0.060       0.088
ash              0.0425      0.009      4.712      0.000       0.025       0.060
water           -0.1713      0.030     -5.707      0.000      -0.230      -0.112
superplastic     0.3039      0.085      3.574      0.000       0.137       0.471
coarseagg       -0.0038      0.007     -0.580      0.562      -0.017       0.009
fineagg         -0.0128      0.008     -1.682      0.093      -0.028       0.002
age              0.3200      0.009     33.684      0.000       0.301       0.339
==============================================================================
Omnibus:                       30.854   Durbin-Watson:                   1.938
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               36.985
Skew:                           0.353   Prob(JB):                     9.31e-09
Kurtosis:                       3.604   Cond. No.                     9.07e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.07e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [27]:
# Correlation matrix
correlation_matrix(concrete_im, 0.8)
Series([], dtype: float64)
In [28]:
# Absolute correlation of independent variables with the target variable
absCorrwithDep = []
allVars = concrete_im.drop('strength', axis = 1).columns

for var in allVars:
    absCorrwithDep.append(abs(concrete_im['strength'].corr(concrete_im[var])))

display(pd.DataFrame([allVars, absCorrwithDep], index = ['Variable', 'Correlation']).T.\
        sort_values('Correlation', ascending = False))
Variable Correlation
7 age 0.52
0 cement 0.49
4 superplastic 0.35
3 water 0.30
6 fineagg 0.19
5 coarseagg 0.17
1 slag 0.14
2 ash 0.10

Observation 6 - Correlation Matrix

  • None of the columns have a correlation above a threshold and thus none to be dropped.
  • age, cement and superplastic are some of the columns that have strong influence over target variable.

Feature Engineering

Performing feature engineering on the cement dataset. Objective here would be:

  • Explore for gaussians. If data is likely to be a mix of gaussians, explore individual clusters and present your findings in terms of the independent attributes and their suitability to predict strength
  • Identify opportunities (if any) to create a composite feature, drop a feature
  • Decide on complexity of the model, should it be simple linear mode in terms of parameters or would a quadratic or higher degree help

Identifying if there are any clusters

In [29]:
concrete_im.reset_index(inplace = True, drop = True)

X = concrete_im.drop('strength', axis = 1)
y = concrete_im['strength']
labels = KMeans(2, random_state = random_state).fit_predict(X)
In [30]:
print('Cement vs other columns clusters'); print('--'*60)
kmeans_plots(X, 'cement')
Cement vs other columns clusters
------------------------------------------------------------------------------------------------------------------------
In [31]:
print('Slag vs other columns clusters'); print('--'*60)
kmeans_plots(X, 'slag')
Slag vs other columns clusters
------------------------------------------------------------------------------------------------------------------------
In [32]:
print('Ash vs other columns clusters'); print('--'*60)
kmeans_plots(X, 'ash')
Ash vs other columns clusters
------------------------------------------------------------------------------------------------------------------------
In [33]:
print('Water vs other columns clusters'); print('--'*60)
kmeans_plots(X, 'water')
Water vs other columns clusters
------------------------------------------------------------------------------------------------------------------------
In [34]:
print('Superplastic vs other columns clusters'); print('--'*60)
kmeans_plots(X, 'superplastic')
Superplastic vs other columns clusters
------------------------------------------------------------------------------------------------------------------------
In [35]:
print('Coarseagg vs other columns clusters'); print('--'*60)
kmeans_plots(X, 'coarseagg')
Coarseagg vs other columns clusters
------------------------------------------------------------------------------------------------------------------------
In [36]:
print('Fineagg vs other columns clusters'); print('--'*60)
kmeans_plots(X, 'fineagg')
Fineagg vs other columns clusters
------------------------------------------------------------------------------------------------------------------------
In [37]:
print('Fineagg vs other columns clusters'); print('--'*60)
kmeans_plots(X, 'age')
Fineagg vs other columns clusters
------------------------------------------------------------------------------------------------------------------------

Observation 7 - Exploring clusters

  • Clusters can be observed with between cement and rest of the independent variables.
  • Cluster at age 100 can be seen.

Adding features based on cluster analysis

Let's add features based on cluster analysis we found for cement and other columns.

In [38]:
# Adding features based on cement clusters
print('Let\'s add features based on cluster analysis we found for cement and other columns'); print('--'*60)

concrete_im = concrete_im.join(pd.DataFrame(labels, columns = ['labels']), how = 'left')
cement_features = concrete_im.groupby('labels', as_index = False)['cement'].agg(['mean', 'median'])
concrete_im = concrete_im.merge(cement_features, on = 'labels', how = 'left')
concrete_im.rename(columns = {'mean': 'cement_labels_mean', 'median': 'cement_labels_median'}, inplace = True)
concrete_im.drop('labels', axis = 1, inplace = True)
display(custom_describe(concrete_im))
Let's add features based on cluster analysis we found for cement and other columns
------------------------------------------------------------------------------------------------------------------------
Count Type Mean StandardDeviation Variance Minimum Q1 Median Q3 Maximum Range IQR Kurtosis Skewness MeanAbsoluteDeviation SkewnessComment OutliersComment
cement 1028 float64 280.74 104.14 10845.73 102.00 192.00 272.80 350.00 540.00 438.00 158.00 -0.51 0.51 86.49 Moderately Skewed (Right) NoOutliers
slag 1028 float64 73.48 85.38 7289.08 0.00 0.00 22.00 142.80 342.10 342.10 142.80 -0.63 0.77 76.32 Moderately Skewed (Right) NoOutliers
ash 1028 float64 54.29 64.01 4097.86 0.00 0.00 0.00 118.30 200.10 200.10 118.30 -1.33 0.53 60.44 Moderately Skewed (Right) NoOutliers
water 1028 float64 181.69 20.56 422.74 126.60 164.90 185.00 192.00 228.00 101.40 27.10 -0.03 0.09 16.35 Fairly Symmetrical (Left) NoOutliers
superplastic 1028 float64 5.98 5.48 29.99 0.00 0.00 6.20 10.10 23.40 23.40 10.10 -0.44 0.47 4.69 Fairly Symmetrical (Left) NoOutliers
coarseagg 1028 float64 972.85 77.66 6030.62 801.00 932.00 968.00 1029.40 1145.00 344.00 97.40 -0.60 -0.04 62.70 Fairly Symmetrical (Right) NoOutliers
fineagg 1028 float64 772.37 78.68 6190.43 594.00 730.30 778.50 822.20 945.00 351.00 91.90 -0.19 -0.33 60.90 Fairly Symmetrical (Left) NoOutliers
age 1028 float64 33.28 27.98 782.75 1.00 7.00 28.00 45.66 120.00 119.00 38.66 0.53 1.16 21.13 Highly Skewed (Right) HasOutliers
strength 1028 float64 35.74 16.64 276.85 2.33 23.70 34.34 45.91 82.60 80.27 22.21 -0.31 0.41 13.41 Fairly Symmetrical (Right) HasOutliers
cement_labels_mean 1028 float64 280.74 85.39 7292.30 202.25 202.25 202.25 373.55 373.55 171.30 171.30 -1.98 0.17 85.05 Fairly Symmetrical (Right) NoOutliers
cement_labels_median 1028 float64 274.50 81.05 6569.82 200.00 200.00 200.00 362.60 362.60 162.60 162.60 -1.98 0.17 80.73 Fairly Symmetrical (Right) NoOutliers
None

Identifying important feature interaction

Check whether there exist any important feature interaction which we can make use of to create new features.

In [39]:
# Splitting the dataset into train and test set for checking feature interaction
print('Checking if there exist any important feature interaction and make use of that to create features')
print('Make use catboostregressor\'s feature interaction'); print('--'*60)

X = concrete_im.drop('strength', axis = 1)
y = concrete_im['strength']
features_list = list(X.columns)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state)

# Initialize CatBoostRegressor
reg = CatBoostRegressor(iterations = None, eval_metric = 'RMSE', random_state = random_state,  od_type = 'Iter', od_wait = 5)
reg.fit(X_train, y_train, early_stopping_rounds = 5, verbose = False, eval_set = [(X_test, y_test)], use_best_model = True)
Checking if there exist any important feature interaction and make use of that to create features
Make use catboostregressor's feature interaction
------------------------------------------------------------------------------------------------------------------------
Out[39]:
<catboost.core.CatBoostRegressor at 0x7fb5843b7250>
In [41]:
# Get feature importance -- Type = Interaction
print('Feature Importance plot for catboostregressor using type = Interaction'); 
print('Adding features based on cement and age; water and age can be useful'); print('--'*60)
FI = reg.get_feature_importance(Pool(X_test, label = y_test), type = 'Interaction')
FI_new = []
for k, item in enumerate(FI):  
    first = X_test.dtypes.index[FI[k][0]]
    second = X_test.dtypes.index[FI[k][1]]
    if first != second:
        FI_new.append([first + "_" + second, FI[k][2]])
feature_score = pd.DataFrame(FI_new, columns = ['FeaturePair', 'Score'])
feature_score = feature_score.sort_values(by = 'Score', ascending = True)
ax = feature_score.plot('FeaturePair', 'Score', kind = 'barh', figsize = (15, 10))
ax.set_title('Pairwise Feature Importance', fontsize = 14)
ax.set_xlabel('Features Pair')
plt.show()
Feature Importance plot for catboostregressor using type = Interaction
Adding features based on cement and age; water and age can be useful
------------------------------------------------------------------------------------------------------------------------

Adding features based on feature interaction

Let's add features based on cement and age; water and age interaction.

In [42]:
# Adding features based on 'feature interaction' we got from above catboostregressor
print('Adding features based on feature interaction we got from catboostregressor\'s feature importance'); print('--'*60)

cement_age = concrete_im.groupby('age', as_index = False)['cement'].agg(['mean', 'median'])
concrete_im = concrete_im.merge(cement_age, on = 'age', how = 'left')
concrete_im.rename(columns = {'mean': 'cement_age_mean', 'median': 'cement_age_median'}, inplace = True)

water_age = concrete_im.groupby('age')['water'].agg(['mean', 'median']); concrete_im = concrete_im.merge(water_age, on = 'age', how = 'left')
concrete_im.rename(columns = {'mean': 'water_age_mean', 'median': 'water_age_median'}, inplace = True)
concrete_im.describe()
Adding features based on feature interaction we got from catboostregressor's feature importance
------------------------------------------------------------------------------------------------------------------------
Out[42]:
cement slag ash water superplastic coarseagg fineagg age strength cement_labels_mean cement_labels_median cement_age_mean cement_age_median water_age_mean water_age_median
count 1028.00 1028.00 1028.00 1028.00 1028.00 1028.00 1028.00 1028.00 1028.00 1028.00 1028.00 1028.00 1028.00 1028.00 1028.00
mean 280.74 73.48 54.29 181.69 5.98 972.85 772.37 33.28 35.74 280.74 274.50 280.74 270.42 181.69 183.12
std 104.14 85.38 64.01 20.56 5.48 77.66 78.68 27.98 16.64 85.39 81.05 31.43 35.00 10.52 13.74
min 102.00 0.00 0.00 126.60 0.00 801.00 594.00 1.00 2.33 202.25 200.00 220.91 213.75 157.76 154.80
25% 192.00 0.00 0.00 164.90 0.00 932.00 730.30 7.00 23.70 202.25 200.00 264.32 254.50 176.63 178.50
50% 272.80 22.00 0.00 185.00 6.20 968.00 778.50 28.00 34.34 202.25 200.00 264.32 260.90 182.81 185.00
75% 350.00 142.80 118.30 192.00 10.10 1029.40 822.20 45.66 45.91 373.55 362.60 294.17 288.50 182.81 185.00
max 540.00 342.10 200.10 228.00 23.40 1145.00 945.00 120.00 82.60 373.55 362.60 442.50 442.50 210.80 228.00

Feature Importance

Let's use model based feature importance, eli5, correlation matrix and absolute correlation to understand important features.

In [44]:
X = concrete_im.drop('strength', axis = 1)
y = concrete_im['strength']
features_list = list(X.columns)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state)

reg = CatBoostRegressor(iterations = None, eval_metric = 'RMSE', random_state = random_state,  od_type = 'Iter', od_wait = 5)
reg.fit(X_train, y_train, early_stopping_rounds = 5, verbose = False, eval_set = [(X_test, y_test)], use_best_model = True)
Out[44]:
<catboost.core.CatBoostRegressor at 0x7fb567f0fd90>
In [45]:
# Get feature importance -- eli5
perm = PermutationImportance(reg, random_state = random_state).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())
Out[45]:
Weight Feature
0.4503 ± 0.1613 age
0.3568 ± 0.0879 cement
0.2047 ± 0.0572 water
0.1806 ± 0.0591 slag
0.0580 ± 0.0132 superplastic
0.0453 ± 0.0162 fineagg
0.0234 ± 0.0067 coarseagg
0.0155 ± 0.0077 cement_labels_median
0.0130 ± 0.0053 cement_age_mean
0.0120 ± 0.0034 water_age_median
0.0119 ± 0.0042 water_age_mean
0.0077 ± 0.0015 cement_labels_mean
0.0049 ± 0.0045 ash
0.0048 ± 0.0091 cement_age_median
In [46]:
# Get feature importance -- model based
print('Feature Importance plot for catboostregressor using type = PredictionValuesChange'); 
print('Age, cement and water are top 3 importance features'); print('--'*60)
FI = reg.get_feature_importance(Pool(X_test, label = y_test), type = 'PredictionValuesChange')
feature_score = pd.DataFrame(list(zip(X_test.dtypes.index, FI)), columns = ['Feature', 'Score'])
feature_score = feature_score.sort_values(by = 'Score', ascending = True)
ax = feature_score.plot('Feature', 'Score', kind = 'barh', figsize = (15, 10))
ax.set_title('Feature Importance', fontsize = 14)
ax.set_xlabel('Features')
plt.show()
Feature Importance plot for catboostregressor using type = PredictionValuesChange
Age, cement and water are top 3 importance features
------------------------------------------------------------------------------------------------------------------------
In [47]:
# Correlation matrix
correlation_matrix(concrete_im, 0.9)
cement_labels_mean    cement_labels_median   1.00
cement_labels_median  cement_labels_mean     1.00
water_age_median      water_age_mean         0.95
water_age_mean        water_age_median       0.95
cement_age_median     cement_age_mean        0.91
cement_age_mean       cement_age_median      0.91
dtype: float64
In [48]:
# Absolute correlation of independent variables with the target variable
absCorrwithDep = []
allVars = concrete_im.drop('strength', axis = 1).columns

for var in allVars:
    absCorrwithDep.append(abs(concrete_im['strength'].corr(concrete_im[var])))

display(pd.DataFrame([allVars, absCorrwithDep], index = ['Variable', 'Correlation']).T.\
        sort_values('Correlation', ascending = False))
Variable Correlation
7 age 0.52
0 cement 0.49
9 cement_labels_median 0.40
8 cement_labels_mean 0.40
4 superplastic 0.35
3 water 0.30
6 fineagg 0.19
5 coarseagg 0.17
1 slag 0.14
2 ash 0.10
13 water_age_median 0.09
10 cement_age_mean 0.08
11 cement_age_median 0.07
12 water_age_mean 0.07
In [49]:
print('Checking if multicollinearity exists')
print('A VIF between 5 and 10 indicates high correlation that may be problematic. And if the VIF goes above 10, you can assume that the regression coefficients are poorly estimated due to multicollinearity.')
print('--'*60)

y, X = dmatrices('strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age + cement_labels_mean + cement_labels_median + cement_age_mean + cement_age_median + water_age_mean + water_age_median', 
                 concrete_im, return_type = 'dataframe')
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['Features'] = X.columns
display(vif.round(1).sort_values(by = 'VIF Factor', ascending = False))
Checking if multicollinearity exists
A VIF between 5 and 10 indicates high correlation that may be problematic. And if the VIF goes above 10, you can assume that the regression coefficients are poorly estimated due to multicollinearity.
------------------------------------------------------------------------------------------------------------------------
VIF Factor Features
9 inf cement_labels_mean
10 inf cement_labels_median
13 15.90 water_age_mean
14 15.00 water_age_median
12 9.10 cement_age_median
1 7.80 cement
11 7.80 cement_age_mean
2 5.90 slag
7 5.60 fineagg
3 5.50 ash
4 5.50 water
6 4.00 coarseagg
5 3.20 superplastic
8 1.30 age
0 0.00 Intercept

Observation 8 - Feature Engineering

  • In the feature engineering steps we identified that there clusters between cement and rest of the independent features. Used that to add features based on cement and cluster labels. Features were added for min, max, mean and median values for cement-labels group.
  • Further, we also made use of CatBoostRegressor's feature interaction to add features based on cement and age; water and age interaction.
  • While doing this steps, we added 6 new features. It was important to check feature importance and correlation matrix at this stage.
  • age, cement, water, slag are some of the importance features based on eli5 and model based feature importance. Dropping all newly added features since they resulted in multicollinearity :(.
In [50]:
concrete_im.drop(['water_age_mean', 'water_age_median', 'cement_age_mean', 'cement_labels_mean', 'cement_labels_median', 'cement_age_mean'], axis = 1, inplace = True)
concrete_im.shape, concrete_im.columns
Out[50]:
((1028, 10),
 Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
        'fineagg', 'age', 'strength', 'cement_age_median'],
       dtype='object'))

Model Complexity

Decide on complexity of the model, should it be simple linear mode in terms of parameters or would a quadratic or higher degree help

In [51]:
print('Split into training (70%), validation(10%) and test(20%) sets for both with EDA and FE & without EDA and FE.')
print('--'*60)

# Training, validation and test sets with outliers
X = concrete.drop('strength', axis = 1); y = concrete['strength']; features_list = list(X.columns)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.12, random_state = random_state)
print(f'Shape of train, valid and test datasets without EDA, FE: {(X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape)}')
print(f'Proportion in the splits for train, valid, test datasets without EDA, FE: {round(len(X_train)/len(X), 2), round(len(X_val)/len(X), 2), round(len(X_test)/len(X), 2)}')

# Training, validation and test sets without outliers
X = concrete_im.drop('strength', axis = 1); y = concrete_im['strength']; features_list = list(X.columns)
X_train_fe, X_test_fe, y_train_fe, y_test_fe = train_test_split(X, y, test_size = 0.2, random_state = random_state)
X_train_fe, X_val_fe, y_train_fe, y_val_fe = train_test_split(X_train_fe, y_train_fe, test_size = 0.12, random_state = random_state)
print(f'\nShape of train, valid and test datasets with EDA, FE: {(X_train_fe.shape, y_train_fe.shape, X_val_fe.shape, y_val_fe.shape, X_test_fe.shape, y_test_fe.shape)}')
print(f'Proportion in the splits for train, valid, test datasets with EDA, FE: {round(len(X_train_fe)/len(X), 2), round(len(X_val_fe)/len(X), 2), round(len(X_test_fe)/len(X), 2)}')

training_test_sets = {'withoutedafe': (X_train, y_train, X_val, y_val), 'withedafe': (X_train_fe, y_train_fe, X_val_fe, y_val_fe)}
Split into training (70%), validation(10%) and test(20%) sets for both with EDA and FE & without EDA and FE.
------------------------------------------------------------------------------------------------------------------------
Shape of train, valid and test datasets without EDA, FE: ((725, 8), (725,), (99, 8), (99,), (206, 8), (206,))
Proportion in the splits for train, valid, test datasets without EDA, FE: (0.7, 0.1, 0.2)

Shape of train, valid and test datasets with EDA, FE: ((723, 9), (723,), (99, 9), (99,), (206, 9), (206,))
Proportion in the splits for train, valid, test datasets with EDA, FE: (0.7, 0.1, 0.2)

Check the improvement from dataset we had, also decide on complexity

In [52]:
print('Let\'s check cross validated scores on linear models and tree-based models on training and validation sets with and without EDA & FE')
print('--'*60)
models = []
models.append(('Linear', LinearRegression()))
models.append(('Lasso', Lasso(random_state = random_state)))
models.append(('Ridge', Ridge(random_state = random_state)))
models.append(('SVR', SVR()))
models.append(('DecisionTree', DecisionTreeRegressor(random_state = random_state)))
models.append(('GradientBoost', GradientBoostingRegressor(random_state = random_state)))
models.append(('AdaBoost', AdaBoostRegressor(random_state = random_state)))
models.append(('ExtraTrees', ExtraTreesRegressor(random_state = random_state)))
models.append(('RandomForest', RandomForestRegressor(random_state = random_state)))
models.append(('Bagging', BaggingRegressor(DecisionTreeRegressor(random_state = random_state), random_state = random_state)))
models.append(('CatBoost', CatBoostRegressor(random_state = random_state, silent = True)))

scoring = 'r2'; results = {}; score = {}

for encoding_label, (_X_train, _y_train, _X_val, _y_val) in training_test_sets.items():
  scores = []; result_cv = []; names = []
  for name, model in models:
    kf = KFold(n_splits = 10, random_state = random_state)
    cv_results = cross_val_score(model, _X_train, _y_train, cv = kf, scoring = scoring)
    result_cv.append(cv_results); names.append(name)
    scores.append([name, cv_results.mean().round(4), cv_results.std().round(4)])
  score[encoding_label] = scores
  results[encoding_label] = [names, result_cv]

print('Let\'s check the cv scores (r2) for sets without EDA and FE')
display(score['withoutedafe'])

print('\nLet\'s check the cv scores (r2) for sets with EDA and FE')
display(score['withedafe'])
Let's check cross validated scores on linear models and tree-based models on training and validation sets with and without EDA & FE
------------------------------------------------------------------------------------------------------------------------
Let's check the cv scores (r2) for sets without EDA and FE
[['Linear', 0.6111, 0.0585],
 ['Lasso', 0.6104, 0.0605],
 ['Ridge', 0.6111, 0.0585],
 ['SVR', 0.2173, 0.0372],
 ['DecisionTree', 0.7869, 0.0707],
 ['GradientBoost', 0.8951, 0.0365],
 ['AdaBoost', 0.7876, 0.036],
 ['ExtraTrees', 0.9072, 0.0376],
 ['RandomForest', 0.8944, 0.0315],
 ['Bagging', 0.8747, 0.0384],
 ['CatBoost', 0.9346, 0.0249]]
Let's check the cv scores (r2) for sets with EDA and FE
[['Linear', 0.7427, 0.0436],
 ['Lasso', 0.7427, 0.0435],
 ['Ridge', 0.7427, 0.0436],
 ['SVR', 0.2227, 0.04],
 ['DecisionTree', 0.7552, 0.1062],
 ['GradientBoost', 0.9064, 0.0262],
 ['AdaBoost', 0.7864, 0.0258],
 ['ExtraTrees', 0.9097, 0.0233],
 ['RandomForest', 0.8994, 0.0278],
 ['Bagging', 0.8881, 0.024],
 ['CatBoost', 0.935, 0.0206]]
In [53]:
pd.options.display.float_format = "{:.4f}".format

scores_df = pd.concat([pd.DataFrame(score['withoutedafe'], columns = ['Model', 'R2 (Mean) Without', 'R2 (Std) Without']).set_index('Model'), 
           pd.DataFrame(score['withedafe'], columns = ['Model', 'R2 (Mean) With', 'R2 (Std) With']).set_index('Model')], axis = 1)
scores_df['Improvement?'] = scores_df['R2 (Mean) With'] - scores_df['R2 (Mean) Without']
display(scores_df)
R2 (Mean) Without R2 (Std) Without R2 (Mean) With R2 (Std) With Improvement?
Model
Linear 0.6111 0.0585 0.7427 0.0436 0.1316
Lasso 0.6104 0.0605 0.7427 0.0435 0.1323
Ridge 0.6111 0.0585 0.7427 0.0436 0.1316
SVR 0.2173 0.0372 0.2227 0.0400 0.0054
DecisionTree 0.7869 0.0707 0.7552 0.1062 -0.0317
GradientBoost 0.8951 0.0365 0.9064 0.0262 0.0113
AdaBoost 0.7876 0.0360 0.7864 0.0258 -0.0012
ExtraTrees 0.9072 0.0376 0.9097 0.0233 0.0025
RandomForest 0.8944 0.0315 0.8994 0.0278 0.0050
Bagging 0.8747 0.0384 0.8881 0.0240 0.0134
CatBoost 0.9346 0.0249 0.9350 0.0206 0.0004
In [54]:
print('A significant improvement in r2 scores after EDA & FE for linear algorithms whereas remains almost same for tree-based algorithms.'); print('--'*60)

fig,(ax1, ax2) = plt.subplots(1, 2, figsize = (20, 7.2))
ax1.boxplot(results['withoutedafe'][1]); ax1.set_xticklabels(results['withoutedafe'][0], rotation = 90); ax1.set_title('CV Score - without EDA and FE')
ax2.boxplot(results['withedafe'][1]); ax2.set_xticklabels(results['withedafe'][0], rotation = 90); ax2.set_title('CV Score - with EDA and FE')
plt.show()
A significant improvement in r2 scores after EDA & FE for linear algorithms whereas remains almost same for tree-based algorithms.
------------------------------------------------------------------------------------------------------------------------

Observation 9 - Model Complexity

  • We see an improvement in the scores against the uncleaned data we had. Improvements are clearly seen for linear algos whereas for tree-based it either marginally increases/decreases.
  • Tree-based algorithms are a clear choice when it comes to linear vs tree-based comparison.
  • CatboostRegressor gives us the highest r2 score.

Scale or not scale?

In [55]:
scalers = {'notscaled': None, 'standardscaling': StandardScaler(), 'robustscaling': RobustScaler()}

training_test_sets = {'validation_sets': (X_train_fe, y_train_fe, X_val_fe, y_val_fe),
                      'test_sets': (X_train_fe, y_train_fe, X_test_fe, y_test_fe)}

# initialize model
cat_reg = CatBoostRegressor(iterations = None, eval_metric = 'RMSE', random_state = random_state, od_type = 'Iter', od_wait = 5)

# iterate over all possible combinations and get the errors
errors = {}
for encoding_label, (_X_train, _y_train, _X_val, _y_val) in training_test_sets.items():
    for scaler_label, scaler in scalers.items():
        scores = []
        if scaler == None:
          trainingset = _X_train.copy()
          testset = _X_val.copy()
          cat_reg.fit(trainingset, _y_train, early_stopping_rounds = 5, verbose = False, plot = False,
                      eval_set = [(testset, _y_val)], use_best_model = True)
          pred = cat_reg.predict(testset)
          rmse = rmse_score(_y_val, pred)
          r2 = r2_score(_y_val, pred)
          scores.append([rmse, r2])
          key = encoding_label + ' - ' + scaler_label
          errors[key] = scores[0]
        else:
          trainingset = _X_train.copy()
          testset = _X_val.copy()
          trainingset = scaler.fit_transform(trainingset)
          testset = scaler.transform(testset)
          cat_reg.fit(trainingset, _y_train, early_stopping_rounds = 5, verbose = False, plot = False,
                      eval_set = [(testset, _y_val)], use_best_model = True)
          pred = cat_reg.predict(testset)
          rmse = rmse_score(_y_val, pred)
          r2 = r2_score(_y_val, pred)
          scores.append([rmse, r2])
          key = encoding_label + ' - ' + scaler_label
          errors[key] = scores[0]
In [56]:
print('It can be seen that RMSE is lowest when robust scaling is used whereas R2 almost remains same as un-scaled data.'); 
print('Scaling would help to effectively use the training and validation sets across algorithms.');print('--'*60)

display(errors)
It can be seen that RMSE is lowest when robust scaling is used whereas R2 almost remains same as un-scaled data.
Scaling would help to effectively use the training and validation sets across algorithms.
------------------------------------------------------------------------------------------------------------------------
{'validation_sets - notscaled': [3.465025537432223, 0.9551300397202107],
 'validation_sets - standardscaling': [3.464514520117853, 0.955143273467297],
 'validation_sets - robustscaling': [3.464514520117853, 0.955143273467297],
 'test_sets - notscaled': [4.872582428665989, 0.9084817539833094],
 'test_sets - standardscaling': [4.872582428665989, 0.9084817539833094],
 'test_sets - robustscaling': [4.871007316946747, 0.908540912823503]}

Observation 10 - Yes, scale please

  • It can be seen that RMSE is lowest when robust scaling is used, whereas R2 score remains almost same as unscaled data. Additionally, scaling would also help to effectively use the training and validation sets across algos.

Modelling

  • Choose an algorithm. Here we would choose/evaluate, 3 linears and 2 decision tree based + AdaBoost, GradientBoost and ExtraTrees regressors. +Bonus: CatBoostRegressor.
  • Employ hyperparameter tuning techniques to squeeze that extra performance out of the model without making it overfit or underfit
  • Model performance range at 95% confidence level
In [57]:
## Helper function to train, validate and predict
def train_val_predict(basemodel, train_X, train_y, test_X, test_y, name, model):

  folds = list(KFold(n_splits = 5, random_state = random_state, shuffle = True).split(train_X, train_y))
  
  r2_scores_train = []; r2_scores_val = []; r2_scores_test = []

  for j, (train_index, val_index) in enumerate(folds):
    X_train = train_X.iloc[train_index]
    y_train = train_y.iloc[train_index]
    X_val = train_X.iloc[val_index]
    y_val = train_y.iloc[val_index]

    if model == 'CatBoost':
      basemodel.fit(X_train, y_train, early_stopping_rounds = 5, verbose = 300, eval_set = [(X_val, y_val)], use_best_model = True)
    else:
      basemodel.fit(X_train, y_train)

    pred = basemodel.predict(X_train)
    r2 = r2_score(y_train, pred); r2_scores_train.append(r2)
    
    pred = basemodel.predict(X_val)
    r2 = r2_score(y_val, pred); r2_scores_val.append(r2)

    pred = basemodel.predict(X_test_fe)
    r2 = r2_score(y_test_fe, pred); r2_scores_test.append(r2)

  df = pd.DataFrame([np.mean(r2_scores_train), np.mean(r2_scores_val), np.mean(r2_scores_test)],
                    index = ['r2 Scores Train', 'r2 Scores Val', 'r2 Scores Test'], 
                    columns = [name]).T
  return df
In [58]:
print('Separating the dependents and independents + Scaling the data'); print('--'*60)
features_list = list(concrete_im.columns)
concrete_im = concrete_im.apply(zscore); concrete_im = pd.DataFrame(concrete_im , columns = features_list)
display(concrete_im.describe())

X = concrete_im.drop('strength', axis = 1); y = concrete_im['strength']; 
X_train_fe, X_test_fe, y_train_fe, y_test_fe = train_test_split(X, y, test_size = 0.2, random_state = random_state)

X_train_fe.shape, X_test_fe.shape, y_train_fe.shape, y_test_fe.shape
Separating the dependents and independents + Scaling the data
------------------------------------------------------------------------------------------------------------------------
cement slag ash water superplastic coarseagg fineagg age strength cement_age_median
count 1028.0000 1028.0000 1028.0000 1028.0000 1028.0000 1028.0000 1028.0000 1028.0000 1028.0000 1028.0000
mean -0.0000 0.0000 0.0000 -0.0000 0.0000 0.0000 -0.0000 0.0000 0.0000 -0.0000
std 1.0005 1.0005 1.0005 1.0005 1.0005 1.0005 1.0005 1.0005 1.0005 1.0005
min -1.7171 -0.8611 -0.8486 -2.6808 -1.0932 -2.2140 -2.2682 -1.1543 -2.0092 -1.6196
25% -0.8525 -0.8611 -0.8486 -0.8171 -1.0932 -0.5263 -0.5350 -0.9397 -0.7244 -0.4549
50% -0.0762 -0.6033 -0.8486 0.1609 0.0404 -0.0625 0.0779 -0.1888 -0.0842 -0.2720
75% 0.6654 0.8123 1.0004 0.5016 0.7521 0.7285 0.6336 0.4428 0.6112 0.5169
max 2.4907 3.1478 2.2788 2.2533 3.1821 2.2179 2.1952 3.1012 2.8174 4.9185
Out[58]:
((822, 9), (206, 9), (822,), (206,))

Linear Regression, Lasso, Ridge

In [59]:
print('Using the 5-Fold Linear Regression to train, validate and predict'); print('--'*60)
lr_reg = LinearRegression()
df_lr = train_val_predict(lr_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold LinearRegression', model = 'LR')
Using the 5-Fold Linear Regression to train, validate and predict
------------------------------------------------------------------------------------------------------------------------
In [60]:
%%time
print('Using the 5-Fold Lasso Regression to train, validate and predict'); print('--'*60)
lasso_reg = Lasso(alpha = 0.01)
df_lasso = train_val_predict(lasso_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold LassoRegression', model = 'Lasso')
df = df_lr.append(df_lasso)
Using the 5-Fold Lasso Regression to train, validate and predict
------------------------------------------------------------------------------------------------------------------------
CPU times: user 78.5 ms, sys: 3.55 ms, total: 82 ms
Wall time: 64.4 ms
In [61]:
%%time
print('Using the 5-Fold Ridge Regression to train, validate and predict'); print('--'*60)
ridge_reg = Ridge(alpha = 0.01)
df_ridge = train_val_predict(ridge_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold RidgeRegression', model = 'Ridge')
df = df.append(df_ridge)
display(df)
Using the 5-Fold Ridge Regression to train, validate and predict
------------------------------------------------------------------------------------------------------------------------
r2 Scores Train r2 Scores Val r2 Scores Test
5-Fold LinearRegression 0.7600 0.7504 0.6671
5-Fold LassoRegression 0.7587 0.7498 0.6653
5-Fold RidgeRegression 0.7600 0.7504 0.6671
CPU times: user 85.4 ms, sys: 4.3 ms, total: 89.7 ms
Wall time: 83.5 ms

Decision Tree and Random Forest

In [62]:
%%time
print('Finding out the hyperparameters for Decision Tree and Random Forest with GridSearchCV'); print('--'*60)
best_params_grid = {}

# Decision Tree and Random Forest Regressor Hyperparameters Grid
param_grid = {'DecisionTree': {'criterion': ['mse', 'mae'], 'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, None]},
              'RandomForest': {'bootstrap': [True, False], 'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, None],
                                'max_features': ['auto', 'sqrt'], 'n_estimators': [200, 400, 600, 800]}}

# Decision Tree Regressor
dt_reg = DecisionTreeRegressor(random_state = random_state)
dt_reg_grid = GridSearchCV(dt_reg, param_grid['DecisionTree'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
dt_reg_grid.fit(X_train_fe, y_train_fe)
best_params_grid['DecisionTree'] = dt_reg_grid.best_params_

# Random Forest Regressor
rf_reg = RandomForestRegressor(random_state = random_state)
rf_reg_grid = GridSearchCV(rf_reg, param_grid['RandomForest'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
rf_reg_grid.fit(X_train_fe, y_train_fe)
best_params_grid['RandomForest'] = rf_reg_grid.best_params_

print(f'Best parameters for Decision Tree and Random Forest using GridSearchCV: {best_params_grid}')
Finding out the hyperparameters for Decision Tree and Random Forest with GridSearchCV
------------------------------------------------------------------------------------------------------------------------
Best parameters for Decision Tree and Random Forest using GridSearchCV: {'DecisionTree': {'criterion': 'mse', 'max_depth': 9}, 'RandomForest': {'bootstrap': False, 'max_depth': None, 'max_features': 'sqrt', 'n_estimators': 200}}
CPU times: user 2.73 s, sys: 401 ms, total: 3.13 s
Wall time: 6min 52s
In [70]:
%%time
print('Finding out the hyperparameters for Decision Tree and Random Forest with RandomizedSearchCV'); print('--'*60)
best_params_random = {}

# Decision Tree and Random Forest Regressor Hyperparameters Grid
param_grid = {'DecisionTree': {'criterion': ['mse', 'mae'], 'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, None]},
              'RandomForest': {'bootstrap': [True, False], 'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, None],
                                'max_features': ['auto', 'sqrt'], 'n_estimators': [200, 400, 600, 800]}}

# Decision Tree Regressor
dt_reg = DecisionTreeRegressor(random_state = random_state)
dt_reg_grid = RandomizedSearchCV(dt_reg, param_grid['DecisionTree'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
dt_reg_grid.fit(X_train_fe, y_train_fe)
best_params_random['DecisionTree'] = dt_reg_grid.best_params_

# Random Forest Regressor
rf_reg = RandomForestRegressor(random_state = random_state)
rf_reg_grid = RandomizedSearchCV(rf_reg, param_grid['RandomForest'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
rf_reg_grid.fit(X_train_fe, y_train_fe)
best_params_random['RandomForest'] = rf_reg_grid.best_params_

print(f'Best parameters for Decision Tree and Random Forest using RandomizedSearchCV: {best_params_random}')
Finding out the hyperparameters for Decision Tree and Random Forest with RandomizedSearchCV
------------------------------------------------------------------------------------------------------------------------
Best parameters for Decision Tree and Random Forest using RandomizedSearchCV: {'DecisionTree': {'max_depth': 9, 'criterion': 'mse'}, 'RandomForest': {'n_estimators': 200, 'max_features': 'auto', 'max_depth': None, 'bootstrap': True}}
CPU times: user 838 ms, sys: 39.7 ms, total: 878 ms
Wall time: 24.5 s
In [71]:
%%time
print('Using the 5-Fold Decision Tree Regressor to train, validate and predict'); print('--'*60)
dt_reg = DecisionTreeRegressor(random_state = random_state)
df_reg = train_val_predict(dt_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold DecisionTree', model = 'DT')
df = df.append(df_reg)
Using the 5-Fold Decision Tree Regressor to train, validate and predict
------------------------------------------------------------------------------------------------------------------------
CPU times: user 67 ms, sys: 2.77 ms, total: 69.7 ms
Wall time: 175 ms
In [72]:
%%time
print('Using the 5-Fold Decision Tree Regressor to train, validate and predict using GridSearchCV'); print('--'*60)
dt_reg_grid = DecisionTreeRegressor(random_state = random_state, **best_params_grid['DecisionTree'])
df_reg_grid = train_val_predict(dt_reg_grid, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold DecisionTree GridSearchCV', model = 'DT')
df = df.append(df_reg_grid)
Using the 5-Fold Decision Tree Regressor to train, validate and predict using GridSearchCV
------------------------------------------------------------------------------------------------------------------------
CPU times: user 76.2 ms, sys: 2.28 ms, total: 78.5 ms
Wall time: 135 ms
In [73]:
%%time
print('Using the 5-Fold Decision Tree Regressor to train, validate and predict using RandomizedSearchCV'); print('--'*60)
dt_reg_rand = DecisionTreeRegressor(random_state = random_state, **best_params_random['DecisionTree'])
df_reg_rand = train_val_predict(dt_reg_rand, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold DecisionTree RandomizedSearchCV', model = 'DT')
df = df.append(df_reg_rand)
display(df)
Using the 5-Fold Decision Tree Regressor to train, validate and predict using RandomizedSearchCV
------------------------------------------------------------------------------------------------------------------------
r2 Scores Train r2 Scores Val r2 Scores Test
5-Fold LinearRegression 0.7600 0.7504 0.6671
5-Fold LassoRegression 0.7587 0.7498 0.6653
5-Fold RidgeRegression 0.7600 0.7504 0.6671
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
5-Fold RandomForest 0.9861 0.9010 0.8777
5-Fold RandomForest GridSearchCV 0.9991 0.9092 0.8809
5-Fold RandomForest RandomizedSearchCV 0.9856 0.8993 0.8722
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
CPU times: user 69.5 ms, sys: 3.96 ms, total: 73.4 ms
Wall time: 118 ms
In [74]:
%%time
print('Using the 5-Fold Random Forest Regressor to train, validate and predict'); print('--'*60)
rf_reg = RandomForestRegressor(random_state = random_state)
df_reg = train_val_predict(rf_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold RandomForest', model = 'RF')
df = df.append(df_reg)
Using the 5-Fold Random Forest Regressor to train, validate and predict
------------------------------------------------------------------------------------------------------------------------
CPU times: user 1.72 s, sys: 46.9 ms, total: 1.77 s
Wall time: 2.26 s
In [75]:
%%time
print('Using the 5-Fold Random Forest Regressor to train, validate and predict using GridSearchCV'); print('--'*60)
rf_reg_grid = RandomForestRegressor(random_state = random_state, **best_params_grid['RandomForest'])
df_reg_grid = train_val_predict(rf_reg_grid, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold RandomForest GridSearchCV', model = 'RF')
df = df.append(df_reg_grid)
Using the 5-Fold Random Forest Regressor to train, validate and predict using GridSearchCV
------------------------------------------------------------------------------------------------------------------------
CPU times: user 2.23 s, sys: 74.4 ms, total: 2.31 s
Wall time: 2.83 s
In [76]:
%%time
print('Using the 5-Fold Random Forest Regressor to train, validate and predict using RandomizedSearchCV'); print('--'*60)
rf_reg_rand = RandomForestRegressor(random_state = random_state, **best_params_random['RandomForest'])
df_reg_rand = train_val_predict(rf_reg_rand, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold RandomForest RandomizedSearchCV', model = 'RF')
df = df.append(df_reg_rand)
display(df)
Using the 5-Fold Random Forest Regressor to train, validate and predict using RandomizedSearchCV
------------------------------------------------------------------------------------------------------------------------
r2 Scores Train r2 Scores Val r2 Scores Test
5-Fold LinearRegression 0.7600 0.7504 0.6671
5-Fold LassoRegression 0.7587 0.7498 0.6653
5-Fold RidgeRegression 0.7600 0.7504 0.6671
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
5-Fold RandomForest 0.9861 0.9010 0.8777
5-Fold RandomForest GridSearchCV 0.9991 0.9092 0.8809
5-Fold RandomForest RandomizedSearchCV 0.9856 0.8993 0.8722
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
5-Fold RandomForest 0.9861 0.9010 0.8777
5-Fold RandomForest GridSearchCV 0.9991 0.9092 0.8809
5-Fold RandomForest RandomizedSearchCV 0.9862 0.9032 0.8776
CPU times: user 3.05 s, sys: 76.7 ms, total: 3.13 s
Wall time: 3.7 s

AdaBoost, GradientBoost and ExtraTrees

In [77]:
%%time
print('Using the 5-Fold Ada Boost Regressor to train, validate and predict'); print('--'*60)
ada_reg = AdaBoostRegressor(random_state = random_state)
df_reg = train_val_predict(ada_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold AdaBoost', model = 'Ada')
df = df.append(df_reg)
Using the 5-Fold Ada Boost Regressor to train, validate and predict
------------------------------------------------------------------------------------------------------------------------
CPU times: user 652 ms, sys: 31 ms, total: 683 ms
Wall time: 979 ms
In [78]:
%%time
# AdaBoost Regressor Hyperparameters Grid
print('Finding out the hyperparameters for AdaBoostRegressor with GridSearchCV'); print('--'*60)

param_grid = {'AdaBoost': {'base_estimator': [DecisionTreeRegressor(random_state = random_state, **best_params_grid['DecisionTree']), None],
                           'n_estimators': [100, 150, 200], 'learning_rate': [0.01, 0.1, 1.0]}}

# AdaBoost Regressor
ada_reg = AdaBoostRegressor(random_state = random_state)
ada_reg_grid = GridSearchCV(ada_reg, param_grid['AdaBoost'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
ada_reg_grid.fit(X_train_fe, y_train_fe)
best_params_grid['AdaBoost'] = ada_reg_grid.best_params_

print('Best parameters for AdaBoost Regressor using GridSearchCV: {}'.format(best_params_grid['AdaBoost']))
Finding out the hyperparameters for AdaBoostRegressor with GridSearchCV
------------------------------------------------------------------------------------------------------------------------
Best parameters for AdaBoost Regressor using GridSearchCV: {'base_estimator': DecisionTreeRegressor(max_depth=9, random_state=2019), 'learning_rate': 1.0, 'n_estimators': 100}
CPU times: user 514 ms, sys: 40.6 ms, total: 555 ms
Wall time: 19.2 s
In [79]:
%%time
# AdaBoost Regressor Hyperparameters Grid
print('Finding out the hyperparameters for AdaBoostRegressor with RandomizedSearchCV'); print('--'*60)

param_grid = {'AdaBoost': {'base_estimator': [DecisionTreeRegressor(random_state = random_state, **best_params_grid['DecisionTree']), None],
                           'n_estimators': [100, 150, 200], 'learning_rate': [0.01, 0.1, 1.0]}}

# AdaBoost Regressor
ada_reg = AdaBoostRegressor(random_state = random_state)
ada_reg_grid = RandomizedSearchCV(ada_reg, param_grid['AdaBoost'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
ada_reg_grid.fit(X_train_fe, y_train_fe)
best_params_random['AdaBoost'] = ada_reg_grid.best_params_

print('Best parameters for AdaBoost Regressor using RandomizedSearchCV: {}'.format(best_params_random['AdaBoost']))
Finding out the hyperparameters for AdaBoostRegressor with RandomizedSearchCV
------------------------------------------------------------------------------------------------------------------------
Best parameters for AdaBoost Regressor using RandomizedSearchCV: {'n_estimators': 100, 'learning_rate': 1.0, 'base_estimator': DecisionTreeRegressor(max_depth=9, random_state=2019)}
CPU times: user 437 ms, sys: 22.9 ms, total: 460 ms
Wall time: 10.1 s
In [80]:
%%time
print('Using the 5-Fold Ada Boost Regressor to train, validate and predict using GridSearchCV'); print('--'*60)
ada_reg_grid = AdaBoostRegressor(random_state = random_state, **best_params_grid['AdaBoost'])
df_reg_grid = train_val_predict(ada_reg_grid, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold AdaBoost using GridSearchCV', model = 'Ada')
df = df.append(df_reg_grid)
Using the 5-Fold Ada Boost Regressor to train, validate and predict using GridSearchCV
------------------------------------------------------------------------------------------------------------------------
CPU times: user 1.6 s, sys: 43.7 ms, total: 1.65 s
Wall time: 2.21 s
In [81]:
%%time
print('Using the 5-Fold Ada Boost Regressor to train, validate and predict using RandomizedSearchCV'); print('--'*60)
ada_reg_rand = AdaBoostRegressor(random_state = random_state, **best_params_random['AdaBoost'])
df_reg_rand = train_val_predict(ada_reg_rand, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold AdaBoost using RandomizedSearchCV', model = 'Ada')
df = df.append(df_reg_rand)
display(df)
Using the 5-Fold Ada Boost Regressor to train, validate and predict using RandomizedSearchCV
------------------------------------------------------------------------------------------------------------------------
r2 Scores Train r2 Scores Val r2 Scores Test
5-Fold LinearRegression 0.7600 0.7504 0.6671
5-Fold LassoRegression 0.7587 0.7498 0.6653
5-Fold RidgeRegression 0.7600 0.7504 0.6671
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
5-Fold RandomForest 0.9861 0.9010 0.8777
5-Fold RandomForest GridSearchCV 0.9991 0.9092 0.8809
5-Fold RandomForest RandomizedSearchCV 0.9856 0.8993 0.8722
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
5-Fold RandomForest 0.9861 0.9010 0.8777
5-Fold RandomForest GridSearchCV 0.9991 0.9092 0.8809
5-Fold RandomForest RandomizedSearchCV 0.9862 0.9032 0.8776
5-Fold AdaBoost 0.8351 0.7922 0.7504
5-Fold AdaBoost using GridSearchCV 0.9936 0.9061 0.8670
5-Fold AdaBoost using RandomizedSearchCV 0.9936 0.9061 0.8670
CPU times: user 1.68 s, sys: 55.6 ms, total: 1.74 s
Wall time: 2.82 s
In [82]:
%%time
# GradientBoostRegressor Hyperparameters Grid
print('Finding out the hyperparameters for GradientBoostRegressor with GridSearchCV'); print('--'*60)

param_grid = {'GradientBoost': {'max_depth': [5, 6, 7, 8, 9, 10, None], 'max_features': ['auto', 'sqrt'], 
                                'n_estimators': [600, 800, 1000]}}

# GradientBoostRegressor
gb_reg = GradientBoostingRegressor(random_state = random_state)
gb_reg_grid = GridSearchCV(gb_reg, param_grid['GradientBoost'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
gb_reg_grid.fit(X_train_fe, y_train_fe)
best_params_grid['GradientBoost'] = gb_reg_grid.best_params_

print('Best parameters for Gradient Boost Regressor using GridSearchCV: {}'.format(best_params_grid['GradientBoost']))
Finding out the hyperparameters for GradientBoostRegressor with GridSearchCV
------------------------------------------------------------------------------------------------------------------------
Best parameters for Gradient Boost Regressor using GridSearchCV: {'max_depth': 5, 'max_features': 'sqrt', 'n_estimators': 600}
CPU times: user 972 ms, sys: 82.1 ms, total: 1.05 s
Wall time: 1min 59s
In [83]:
%%time
# GradientBoostRegressor Hyperparameters Grid
print('Finding out the hyperparameters for GradientBoostRegressor with RandomizedSearchCV'); print('--'*60)

param_grid = {'GradientBoost': {'max_depth': [5, 6, 7, 8, 9, 10, None], 'max_features': ['auto', 'sqrt'], 
                                'n_estimators': [600, 800, 1000]}}

# GradientBoostRegressor
gb_reg = GradientBoostingRegressor(random_state = random_state)
gb_reg_rand = RandomizedSearchCV(gb_reg, param_grid['GradientBoost'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
gb_reg_rand.fit(X_train_fe, y_train_fe)
best_params_random['GradientBoost'] = gb_reg_rand.best_params_

print('Best parameters for Gradient Boost Regressor using RandomizedSearchCV: {}'.format(best_params_random['GradientBoost']))
Finding out the hyperparameters for GradientBoostRegressor with RandomizedSearchCV
------------------------------------------------------------------------------------------------------------------------
Best parameters for Gradient Boost Regressor using RandomizedSearchCV: {'n_estimators': 1000, 'max_features': 'sqrt', 'max_depth': 5}
CPU times: user 853 ms, sys: 40 ms, total: 893 ms
Wall time: 26.3 s
In [84]:
%%time
print('Using the 5-Fold Gradient Boost Regressor to train, validate and predict'); print('--'*60)
gb_reg = GradientBoostingRegressor(random_state = random_state)
df_reg = train_val_predict(gb_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold GradientBoost', model = 'GB')
df = df.append(df_reg)
Using the 5-Fold Gradient Boost Regressor to train, validate and predict
------------------------------------------------------------------------------------------------------------------------
CPU times: user 554 ms, sys: 11.1 ms, total: 565 ms
Wall time: 830 ms
In [85]:
%%time
print('Using the 5-Fold Gradient Boost Regressor to train, validate and predict using GridSearchCV'); print('--'*60)
gb_reg_grid = GradientBoostingRegressor(random_state = random_state, **best_params_grid['GradientBoost'])
df_reg_grid = train_val_predict(gb_reg_grid, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold GradientBoost using GridSearchCV', model = 'GB')
df = df.append(df_reg_grid)
Using the 5-Fold Gradient Boost Regressor to train, validate and predict using GridSearchCV
------------------------------------------------------------------------------------------------------------------------
CPU times: user 2.14 s, sys: 21.9 ms, total: 2.16 s
Wall time: 2.54 s
In [86]:
%%time
print('Using the 5-Fold Gradient Boost Regressor to train, validate and predict using RandomizedSearchCV'); print('--'*60)
gb_reg_rand = GradientBoostingRegressor(random_state = random_state, **best_params_random['GradientBoost'])
df_reg_rand = train_val_predict(gb_reg_rand, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold GradientBoost using RandomizedSearchCV', model = 'GB')
df = df.append(df_reg_rand)
display(df)
Using the 5-Fold Gradient Boost Regressor to train, validate and predict using RandomizedSearchCV
------------------------------------------------------------------------------------------------------------------------
r2 Scores Train r2 Scores Val r2 Scores Test
5-Fold LinearRegression 0.7600 0.7504 0.6671
5-Fold LassoRegression 0.7587 0.7498 0.6653
5-Fold RidgeRegression 0.7600 0.7504 0.6671
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
5-Fold RandomForest 0.9861 0.9010 0.8777
5-Fold RandomForest GridSearchCV 0.9991 0.9092 0.8809
5-Fold RandomForest RandomizedSearchCV 0.9856 0.8993 0.8722
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
5-Fold RandomForest 0.9861 0.9010 0.8777
5-Fold RandomForest GridSearchCV 0.9991 0.9092 0.8809
5-Fold RandomForest RandomizedSearchCV 0.9862 0.9032 0.8776
5-Fold AdaBoost 0.8351 0.7922 0.7504
5-Fold AdaBoost using GridSearchCV 0.9936 0.9061 0.8670
5-Fold AdaBoost using RandomizedSearchCV 0.9936 0.9061 0.8670
5-Fold GradientBoost 0.9564 0.9019 0.8671
5-Fold GradientBoost using GridSearchCV 0.9990 0.9304 0.9012
5-Fold GradientBoost using RandomizedSearchCV 0.9991 0.9304 0.9011
CPU times: user 3.51 s, sys: 30.6 ms, total: 3.54 s
Wall time: 3.95 s
In [87]:
%%time
# ExtraTreesRegressor Hyperparameters Grid
print('Finding out the hyperparameters for ExtraTreesRegressor with GridSearchCV'); print('--'*60)

param_grid = {'ExtraTrees': {'max_depth': [5, 6, 7, 8, 9, 10, None], 'max_features': ['auto', 'sqrt'], 
                                'n_estimators': [100, 600, 800, 1000]}}

# ExtraTreesRegressor
et_reg = ExtraTreesRegressor(random_state = random_state)
et_reg_grid = GridSearchCV(et_reg, param_grid['ExtraTrees'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
et_reg_grid.fit(X_train_fe, y_train_fe)
best_params_grid['ExtraTrees'] = et_reg_grid.best_params_

print('Best parameters for Extra Trees Regressor using GridSearchCV: {}'.format(best_params_grid['ExtraTrees']))
Finding out the hyperparameters for ExtraTreesRegressor with GridSearchCV
------------------------------------------------------------------------------------------------------------------------
Best parameters for Extra Trees Regressor using GridSearchCV: {'max_depth': None, 'max_features': 'auto', 'n_estimators': 1000}
CPU times: user 2.76 s, sys: 184 ms, total: 2.95 s
Wall time: 2min 2s
In [88]:
%%time
# ExtraTreesRegressor Hyperparameters Grid
print('Finding out the hyperparameters for ExtraTreesRegressor with RandomizedSearchCV'); print('--'*60)

param_grid = {'ExtraTrees': {'max_depth': [5, 6, 7, 8, 9, 10, None], 'max_features': ['auto', 'sqrt'], 
                                'n_estimators': [100, 600, 800, 1000]}}

# ExtraTreesRegressor
et_reg = ExtraTreesRegressor(random_state = random_state)
et_reg_rand = RandomizedSearchCV(et_reg, param_grid['ExtraTrees'], cv = 5, n_jobs = -1, verbose = False, scoring = 'r2')
et_reg_rand.fit(X_train_fe, y_train_fe)
best_params_random['ExtraTrees'] = et_reg_rand.best_params_

print('Best parameters for Extra Trees Regressor using RandomizedSearchCV: {}'.format(best_params_random['ExtraTrees']))
Finding out the hyperparameters for ExtraTreesRegressor with RandomizedSearchCV
------------------------------------------------------------------------------------------------------------------------
Best parameters for Extra Trees Regressor using RandomizedSearchCV: {'n_estimators': 800, 'max_features': 'auto', 'max_depth': None}
CPU times: user 1.91 s, sys: 118 ms, total: 2.03 s
Wall time: 25.5 s
In [89]:
%%time
print('Using the 5-Fold Extra Trees Regressor to train, validate and predict'); print('--'*60)
et_reg = ExtraTreesRegressor(random_state = random_state)
df_reg = train_val_predict(et_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold ExtraTrees', model = 'ET')
df = df.append(df_reg)
Using the 5-Fold Extra Trees Regressor to train, validate and predict
------------------------------------------------------------------------------------------------------------------------
CPU times: user 1.27 s, sys: 60.5 ms, total: 1.33 s
Wall time: 1.87 s
In [90]:
%%time
print('Using the 5-Fold Extra Trees Regressor to train, validate and predict using GridSearchCV'); print('--'*60)
et_reg_grid = ExtraTreesRegressor(random_state = random_state, **best_params_grid['ExtraTrees'])
df_reg_grid = train_val_predict(et_reg_grid, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold ExtraTrees using GridSearchCV', model = 'ET')
df = df.append(df_reg_grid)
Using the 5-Fold Extra Trees Regressor to train, validate and predict using GridSearchCV
------------------------------------------------------------------------------------------------------------------------
CPU times: user 11 s, sys: 376 ms, total: 11.4 s
Wall time: 12.6 s
In [91]:
%%time
print('Using the 5-Fold Extra Trees Regressor to train, validate and predict using RandomizedSearchCV'); print('--'*60)
et_reg_rand = ExtraTreesRegressor(random_state = random_state, **best_params_random['ExtraTrees'])
df_reg_rand = train_val_predict(et_reg_rand, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold ExtraTrees using RandomizedSearchCV', model = 'ET')
df = df.append(df_reg_rand)
display(df)
Using the 5-Fold Extra Trees Regressor to train, validate and predict using RandomizedSearchCV
------------------------------------------------------------------------------------------------------------------------
r2 Scores Train r2 Scores Val r2 Scores Test
5-Fold LinearRegression 0.7600 0.7504 0.6671
5-Fold LassoRegression 0.7587 0.7498 0.6653
5-Fold RidgeRegression 0.7600 0.7504 0.6671
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
5-Fold RandomForest 0.9861 0.9010 0.8777
5-Fold RandomForest GridSearchCV 0.9991 0.9092 0.8809
5-Fold RandomForest RandomizedSearchCV 0.9856 0.8993 0.8722
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
5-Fold RandomForest 0.9861 0.9010 0.8777
5-Fold RandomForest GridSearchCV 0.9991 0.9092 0.8809
5-Fold RandomForest RandomizedSearchCV 0.9862 0.9032 0.8776
5-Fold AdaBoost 0.8351 0.7922 0.7504
5-Fold AdaBoost using GridSearchCV 0.9936 0.9061 0.8670
5-Fold AdaBoost using RandomizedSearchCV 0.9936 0.9061 0.8670
5-Fold GradientBoost 0.9564 0.9019 0.8671
5-Fold GradientBoost using GridSearchCV 0.9990 0.9304 0.9012
5-Fold GradientBoost using RandomizedSearchCV 0.9991 0.9304 0.9011
5-Fold ExtraTrees 0.9991 0.9106 0.8762
5-Fold ExtraTrees using GridSearchCV 0.9991 0.9132 0.8762
5-Fold ExtraTrees using RandomizedSearchCV 0.9991 0.9129 0.8766
CPU times: user 8.5 s, sys: 263 ms, total: 8.76 s
Wall time: 9.59 s

CatBoostRegressor

In [92]:
%%time
print('Finding out the hyperparameters for CatBoost with GridSearch'); print('--'*60)
param_grid = {'CatBoost': {'learning_rate': np.arange(0.01, 0.31, 0.05), 'depth': [3, 4, 5, 6, 7, 8, 9, 10], 'l2_leaf_reg': np.arange(2, 10, 1)}}

# Cat Boost Regressor
cat_reg = CatBoostRegressor(iterations = None, random_state = random_state, od_type = 'Iter', od_wait = 5)
best_params = cat_reg.grid_search(param_grid['CatBoost'], X = X_train_fe, y = y_train_fe, cv = 3, verbose = 150)
best_params_grid['CatBoostGridSearch'] = best_params['params']
Finding out the hyperparameters for CatBoost with GridSearch
------------------------------------------------------------------------------------------------------------------------

bestTest = 0.3350859458
bestIteration = 998

0:	loss: 0.3350859	best: 0.3350859 (0)	total: 847ms	remaining: 5m 24s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.31878181
bestIteration = 294

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2884496855
bestIteration = 390

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2741704555
bestIteration = 228

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2817607447
bestIteration = 170

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3023768523
bestIteration = 71


bestTest = 0.3370487074
bestIteration = 998

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3172362664
bestIteration = 270

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3297042572
bestIteration = 126

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2987577624
bestIteration = 155

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2887605646
bestIteration = 133

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2861313275
bestIteration = 114


bestTest = 0.3379367736
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3320807063
bestIteration = 229

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3057035598
bestIteration = 244

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2857150111
bestIteration = 221

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3144095877
bestIteration = 74

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2788515522
bestIteration = 149


bestTest = 0.3387465797
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3308808147
bestIteration = 223

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3323089515
bestIteration = 124

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3121817885
bestIteration = 126

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2829151079
bestIteration = 192

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3066921597
bestIteration = 81


bestTest = 0.3412640436
bestIteration = 998

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3213621051
bestIteration = 306

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.32751334
bestIteration = 166

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3133960524
bestIteration = 138

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3010271547
bestIteration = 121

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2922587545
bestIteration = 108


bestTest = 0.3428912084
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.324886413
bestIteration = 271

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3281265774
bestIteration = 153

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3284852128
bestIteration = 87

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2820069036
bestIteration = 154

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2762608248
bestIteration = 155


bestTest = 0.3440088562
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3347172386
bestIteration = 243

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3218000621
bestIteration = 205

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3262725479
bestIteration = 92

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3308860752
bestIteration = 65

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3282980254
bestIteration = 52


bestTest = 0.3441401902
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3258673488
bestIteration = 292

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.334266911
bestIteration = 129

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.329427418
bestIteration = 85

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2840489613
bestIteration = 208

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3260655696
bestIteration = 44


bestTest = 0.3186822868
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2979027524
bestIteration = 282

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2941347407
bestIteration = 203

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2856701878
bestIteration = 112

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2812084623
bestIteration = 118

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.314364638
bestIteration = 45


bestTest = 0.3209426619
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2873808187
bestIteration = 343

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3229527885
bestIteration = 104

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2984375905
bestIteration = 83

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.267826339
bestIteration = 146

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2721383204
bestIteration = 141


bestTest = 0.3233423366
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2983242352
bestIteration = 329

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2894045859
bestIteration = 259

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2973696311
bestIteration = 114

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2961255587
bestIteration = 98

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2680971517
bestIteration = 123


bestTest = 0.325410013
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3087226167
bestIteration = 238

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3110915554
bestIteration = 111

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2969408827
bestIteration = 131

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3149910254
bestIteration = 53

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2807977129
bestIteration = 112


bestTest = 0.3264373313
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3049794953
bestIteration = 305

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3177166208
bestIteration = 111

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2822182496
bestIteration = 164

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2727383435
bestIteration = 166

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3134061365
bestIteration = 52


bestTest = 0.328653363
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2985313923
bestIteration = 369

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3270986301
bestIteration = 92

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3290218724
bestIteration = 49

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3003512093
bestIteration = 108

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2721685873
bestIteration = 162


bestTest = 0.3285035501
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3141020985
bestIteration = 230

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.307961732
bestIteration = 187

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2871569641
bestIteration = 179

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3059788359
bestIteration = 74

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3106172004
bestIteration = 65

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3305313323
bestIteration = 958

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3095190178
bestIteration = 282

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3046929582
bestIteration = 170

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3005778442
bestIteration = 126

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3179343326
bestIteration = 63

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3222300939
bestIteration = 42


bestTest = 0.3061440257
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.26913241
bestIteration = 293

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2907777078
bestIteration = 129

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2531720195
bestIteration = 179

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2862584082
bestIteration = 90

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2937311447
bestIteration = 69


bestTest = 0.3097831618
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2699652213
bestIteration = 381

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2706372087
bestIteration = 205

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2780435041
bestIteration = 90

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.284695982
bestIteration = 78

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2952732158
bestIteration = 54

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3136581371
bestIteration = 954

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3014173033
bestIteration = 195

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2692942905
bestIteration = 263

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2603467277
bestIteration = 168

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2941273295
bestIteration = 55

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2697441046
bestIteration = 93


bestTest = 0.3132790731
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2917850587
bestIteration = 287

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.272262252
bestIteration = 248

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2878276599
bestIteration = 73

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2571805482
bestIteration = 164

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2978435057
bestIteration = 41


bestTest = 0.3158049413
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2982146675
bestIteration = 237

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.298466229
bestIteration = 131

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2883111269
bestIteration = 118

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2735444866
bestIteration = 126

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2568687314
bestIteration = 119


bestTest = 0.3193438585
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3087332662
bestIteration = 232

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2663721051
bestIteration = 269

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2862105441
bestIteration = 107

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2918052746
bestIteration = 77

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2909968355
bestIteration = 58

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3223139488
bestIteration = 938

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2904335377
bestIteration = 312

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2782265825
bestIteration = 202

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.288952687
bestIteration = 117

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.299812029
bestIteration = 87

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2782154336
bestIteration = 120

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3263195259
bestIteration = 904

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3110831196
bestIteration = 196

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2874417698
bestIteration = 167

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3068457237
bestIteration = 63

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.302925754
bestIteration = 73

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2725067244
bestIteration = 142

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.293721208
bestIteration = 966

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2832005394
bestIteration = 194

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2994556004
bestIteration = 85

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.306760268
bestIteration = 61

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2850169833
bestIteration = 97

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2765955948
bestIteration = 82


bestTest = 0.3006905673
bestIteration = 999

150:	loss: 0.3006906	best: 0.2531720 (99)	total: 43.3s	remaining: 1m 6s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2731680944
bestIteration = 285

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2949762951
bestIteration = 103

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2820780178
bestIteration = 129

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2960272597
bestIteration = 67

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3020208611
bestIteration = 46


bestTest = 0.302465464
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2809261992
bestIteration = 297

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2966042562
bestIteration = 103

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2678685766
bestIteration = 282

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2573009671
bestIteration = 177

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.272060072
bestIteration = 118

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3082401646
bestIteration = 966

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3092362677
bestIteration = 153

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2828499541
bestIteration = 149

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2834137502
bestIteration = 175

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2930993937
bestIteration = 66

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2694459608
bestIteration = 105

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3117902691
bestIteration = 876

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3082018005
bestIteration = 180

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3092949632
bestIteration = 95

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2947564399
bestIteration = 145

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3055029497
bestIteration = 48

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2869649479
bestIteration = 89

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3167650612
bestIteration = 862

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3159361054
bestIteration = 183

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2948680996
bestIteration = 156

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3234430965
bestIteration = 63

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3046231409
bestIteration = 67

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3041245666
bestIteration = 55

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3181549907
bestIteration = 893

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3042789094
bestIteration = 244

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3124647182
bestIteration = 83

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.292998553
bestIteration = 119

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3013020976
bestIteration = 82

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2931365756
bestIteration = 81

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3192382712
bestIteration = 907

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3138227346
bestIteration = 152

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2876801062
bestIteration = 244

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3121894521
bestIteration = 104

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3121742992
bestIteration = 64

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3104909537
bestIteration = 42


bestTest = 0.2908646068
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2648812277
bestIteration = 314

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2609512825
bestIteration = 175

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2756536732
bestIteration = 153

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2842811112
bestIteration = 88

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3069320643
bestIteration = 43


bestTest = 0.2945966045
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2887456665
bestIteration = 180

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2739143857
bestIteration = 168

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2816524437
bestIteration = 128

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2835313934
bestIteration = 108

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2751563029
bestIteration = 91

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3027939738
bestIteration = 929

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2689986461
bestIteration = 407

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2749892035
bestIteration = 181

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3032333823
bestIteration = 67

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2637253108
bestIteration = 142

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2870598091
bestIteration = 62

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3097064063
bestIteration = 894

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2844207497
bestIteration = 299

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2886296048
bestIteration = 133

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2961379619
bestIteration = 84

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2647708865
bestIteration = 173

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2687966128
bestIteration = 122

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3119767408
bestIteration = 879

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3064646229
bestIteration = 153

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2856691705
bestIteration = 181

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2706573349
bestIteration = 236

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2826676654
bestIteration = 142

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2765086564
bestIteration = 130

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3150587583
bestIteration = 871

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2956582448
bestIteration = 263

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2780471711
bestIteration = 268

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2815955428
bestIteration = 181

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2941898529
bestIteration = 70

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3136493504
bestIteration = 51

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3134316822
bestIteration = 931

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.282007936
bestIteration = 308

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3061763436
bestIteration = 92

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2887581629
bestIteration = 176

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2843399331
bestIteration = 171

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3193890996
bestIteration = 61

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3145429197
bestIteration = 952

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3121269299
bestIteration = 167

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2949237471
bestIteration = 200

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3013947224
bestIteration = 85

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3028957346
bestIteration = 84

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3147615113
bestIteration = 68


bestTest = 0.2882110139
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2729221141
bestIteration = 285

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2892830149
bestIteration = 95

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2743019837
bestIteration = 129

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2819612056
bestIteration = 106

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2653434737
bestIteration = 102


bestTest = 0.2925688291
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2707857206
bestIteration = 348

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2779880146
bestIteration = 206

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2664048875
bestIteration = 168

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2867296615
bestIteration = 90

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3112488982
bestIteration = 59


bestTest = 0.300015245
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2621749485
bestIteration = 487

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2778431758
bestIteration = 224

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2767067626
bestIteration = 112

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3162225505
bestIteration = 51

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2834483024
bestIteration = 115

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3031985058
bestIteration = 934

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2712749857
bestIteration = 402

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2985641832
bestIteration = 145

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2783929357
bestIteration = 154

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2919372221
bestIteration = 81

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2746122042
bestIteration = 76

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3106289497
bestIteration = 898

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3078114393
bestIteration = 163

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3182241157
bestIteration = 78

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.287841231
bestIteration = 155

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2742135099
bestIteration = 110

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2897159543
bestIteration = 89

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3118034362
bestIteration = 913

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3186409489
bestIteration = 153

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2801261945
bestIteration = 257

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2728554014
bestIteration = 171

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3065472297
bestIteration = 54

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2875046296
bestIteration = 108


bestTest = 0.3101581587
bestIteration = 997

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3063420918
bestIteration = 158

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.314294937
bestIteration = 106

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3157888122
bestIteration = 99

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.279660282
bestIteration = 100

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2798581536
bestIteration = 112

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3121851444
bestIteration = 982

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3095717913
bestIteration = 168

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3213408523
bestIteration = 99

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2806213295
bestIteration = 240

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.282365299
bestIteration = 174

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3284776694
bestIteration = 42


bestTest = 0.2868831641
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2765581856
bestIteration = 326

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.305143505
bestIteration = 96

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2942549609
bestIteration = 93

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2753616767
bestIteration = 104

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3067746628
bestIteration = 58


bestTest = 0.2957568923
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2680169718
bestIteration = 332

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2794742854
bestIteration = 206

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2865742759
bestIteration = 124

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2811919047
bestIteration = 146

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3096440097
bestIteration = 69


bestTest = 0.3037236697
bestIteration = 999

300:	loss: 0.3037237	best: 0.2531720 (99)	total: 2m 52s	remaining: 47.6s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2845925082
bestIteration = 323

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2853841254
bestIteration = 160

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2910297779
bestIteration = 116

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2963023434
bestIteration = 68

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3000127291
bestIteration = 53


bestTest = 0.3051679766
bestIteration = 995

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2808337101
bestIteration = 398

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2962753695
bestIteration = 143

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2800651037
bestIteration = 199

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.289306991
bestIteration = 146

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2896051501
bestIteration = 111


bestTest = 0.3064963688
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2818136892
bestIteration = 368

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2716561545
bestIteration = 341

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2970063089
bestIteration = 167

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2740107189
bestIteration = 180

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3028371058
bestIteration = 60


bestTest = 0.3060864628
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2757900289
bestIteration = 479

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2979852935
bestIteration = 200

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2791468549
bestIteration = 190

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3086101843
bestIteration = 96

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3189516453
bestIteration = 45


bestTest = 0.3093533882
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3113872963
bestIteration = 222

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3161942263
bestIteration = 120

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2830590514
bestIteration = 215

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2765301469
bestIteration = 253

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2826165923
bestIteration = 149


bestTest = 0.3103769582
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2972885636
bestIteration = 291

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.322809005
bestIteration = 116

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3193673662
bestIteration = 66

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3239599788
bestIteration = 55

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.28898599
bestIteration = 111


bestTest = 0.2950983502
bestIteration = 998

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2800504374
bestIteration = 381

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2865669773
bestIteration = 138

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3025715989
bestIteration = 72

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3099843398
bestIteration = 84

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.314005517
bestIteration = 87


bestTest = 0.2998264497
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2780635892
bestIteration = 354

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2817605605
bestIteration = 160

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2816269751
bestIteration = 196

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.308932414
bestIteration = 130

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.288557951
bestIteration = 106


bestTest = 0.303488405
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2734515031
bestIteration = 336

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2945854481
bestIteration = 122

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2739362509
bestIteration = 197

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3001181139
bestIteration = 215

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3052478892
bestIteration = 127


bestTest = 0.3085376804
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2946131883
bestIteration = 179

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2775640942
bestIteration = 282

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3130076997
bestIteration = 83

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3050763828
bestIteration = 86

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3043982081
bestIteration = 111


bestTest = 0.3077563051
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2775649483
bestIteration = 446

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2885659497
bestIteration = 234

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2864185762
bestIteration = 131

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2884038306
bestIteration = 117

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3086118374
bestIteration = 111


bestTest = 0.3106097391
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2914126211
bestIteration = 214

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2768074026
bestIteration = 333

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2964159861
bestIteration = 183

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2914171188
bestIteration = 218

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3097782312
bestIteration = 128


bestTest = 0.315587337
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2834070777
bestIteration = 400

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2781721058
bestIteration = 256

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2843894024
bestIteration = 216

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2905658828
bestIteration = 167

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3111493174
bestIteration = 87


bestTest = 0.3141762212
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.300825177
bestIteration = 276

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3045328049
bestIteration = 135

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2810724809
bestIteration = 271

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2868017953
bestIteration = 156

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3244804536
bestIteration = 46

383:	loss: 0.3244805	best: 0.2531720 (99)	total: 10m 19s	remaining: 0us
Estimating final quality...
Stopped by overfitting detector  (5 iterations wait)
CPU times: user 19min 9s, sys: 6min 46s, total: 25min 55s
Wall time: 10min 23s
In [93]:
%%time
print('Finding out the hyperparameters for CatBoost with RandomSearch'); print('--'*60)
param_grid = {'CatBoost': {'learning_rate': np.arange(0.01, 0.31, 0.05), 'depth': [3, 4, 5, 6, 7, 8, 9, 10], 'l2_leaf_reg': np.arange(2, 10, 1)}}

# Cat Boost Regressor
cat_reg = CatBoostRegressor(iterations = None, random_state = random_state, od_type = 'Iter', od_wait = 5)
best_params = cat_reg.randomized_search(param_grid['CatBoost'], X = X_train_fe, y = y_train_fe, cv = 3, verbose = 150)
best_params_grid['CatBoostRandomSearch'] = best_params['params']
Finding out the hyperparameters for CatBoost with RandomSearch
------------------------------------------------------------------------------------------------------------------------
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2887605646
bestIteration = 133

0:	loss: 0.2887606	best: 0.2887606 (0)	total: 269ms	remaining: 2.42s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3014173033
bestIteration = 195

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2947564399
bestIteration = 145

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3167650612
bestIteration = 862

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.282007936
bestIteration = 308

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.279660282
bestIteration = 100


bestTest = 0.3064963688
bestIteration = 999

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3028371058
bestIteration = 60

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3043982081
bestIteration = 111

Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3086118374
bestIteration = 111

9:	loss: 0.3086118	best: 0.2796603 (5)	total: 16.5s	remaining: 0us
Estimating final quality...
Stopped by overfitting detector  (5 iterations wait)
CPU times: user 48.4 s, sys: 8.79 s, total: 57.2 s
Wall time: 24.6 s
In [94]:
%%time
print('Using the 5-Fold CatBoost Regressor to train, validate and predict'); print('--'*60)
cb_reg = CatBoostRegressor(iterations = None, random_state = random_state, od_type = 'Iter', od_wait = 5)
df_reg = train_val_predict(cb_reg, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold CatBoost', model = 'CatBoost')
df = df.append(df_reg)
Using the 5-Fold CatBoost Regressor to train, validate and predict
------------------------------------------------------------------------------------------------------------------------
Learning rate set to 0.042561
0:	learn: 0.9709968	test: 1.0205756	best: 1.0205756 (0)	total: 16.7ms	remaining: 16.6s
300:	learn: 0.1848948	test: 0.2499218	best: 0.2499218 (300)	total: 581ms	remaining: 1.35s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2359332958
bestIteration = 413

Shrink model to first 414 iterations.
Learning rate set to 0.042561
0:	learn: 0.9787593	test: 1.0007861	best: 1.0007861 (0)	total: 1.45ms	remaining: 1.45s
300:	learn: 0.1739228	test: 0.2722814	best: 0.2722814 (300)	total: 500ms	remaining: 1.16s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2528417841
bestIteration = 564

Shrink model to first 565 iterations.
Learning rate set to 0.042573
0:	learn: 0.9838777	test: 0.9783261	best: 0.9783261 (0)	total: 1.18ms	remaining: 1.18s
300:	learn: 0.1692673	test: 0.3267320	best: 0.3267320 (300)	total: 444ms	remaining: 1.03s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3184696151
bestIteration = 361

Shrink model to first 362 iterations.
Learning rate set to 0.042573
0:	learn: 0.9754995	test: 0.9981275	best: 0.9981275 (0)	total: 1.43ms	remaining: 1.43s
300:	learn: 0.1692974	test: 0.3098727	best: 0.3098727 (300)	total: 367ms	remaining: 851ms
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2992286417
bestIteration = 422

Shrink model to first 423 iterations.
Learning rate set to 0.042573
0:	learn: 0.9966460	test: 0.9240496	best: 0.9240496 (0)	total: 1.54ms	remaining: 1.53s
300:	learn: 0.1773478	test: 0.2198035	best: 0.2198035 (300)	total: 411ms	remaining: 954ms
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2119459944
bestIteration = 378

Shrink model to first 379 iterations.
CPU times: user 7.7 s, sys: 889 ms, total: 8.58 s
Wall time: 4.31 s
In [95]:
%%time
print('Using the 5-Fold CatBoost Regressor to train, validate and predict using GridSearch'); print('--'*60)
cb_reg_grid = CatBoostRegressor(iterations = None, random_state = random_state, od_type = 'Iter', od_wait = 5, **best_params_grid['CatBoostGridSearch'])
df_reg_grid = train_val_predict(cb_reg_grid, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold CatBoost GridSearch', model = 'CatBoost')
df = df.append(df_reg_grid)
Using the 5-Fold CatBoost Regressor to train, validate and predict using GridSearch
------------------------------------------------------------------------------------------------------------------------
0:	learn: 0.9000772	test: 0.9438384	best: 0.9438384 (0)	total: 1.91ms	remaining: 1.91s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2201061312
bestIteration = 145

Shrink model to first 146 iterations.
0:	learn: 0.9044620	test: 0.9313982	best: 0.9313982 (0)	total: 1.19ms	remaining: 1.19s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2430790481
bestIteration = 137

Shrink model to first 138 iterations.
0:	learn: 0.9077839	test: 0.9157462	best: 0.9157462 (0)	total: 1.26ms	remaining: 1.26s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3212914632
bestIteration = 117

Shrink model to first 118 iterations.
0:	learn: 0.9020886	test: 0.9226422	best: 0.9226422 (0)	total: 980us	remaining: 980ms
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3140545564
bestIteration = 88

Shrink model to first 89 iterations.
0:	learn: 0.9218339	test: 0.8503385	best: 0.8503385 (0)	total: 834us	remaining: 834ms
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.206473243
bestIteration = 155

Shrink model to first 156 iterations.
CPU times: user 1.74 s, sys: 266 ms, total: 2.01 s
Wall time: 1.59 s
In [96]:
%%time
print('Using the 5-Fold CatBoost Regressor to train, validate and predict using RandomSearch'); print('--'*60)
cb_reg_rand = CatBoostRegressor(iterations = None, random_state = random_state, od_type = 'Iter', od_wait = 5, **best_params_grid['CatBoostRandomSearch'], verbose = False)
df_reg_rand = train_val_predict(cb_reg_rand, X_train_fe, y_train_fe, X_test_fe, y_test_fe, '5-Fold CatBoost RandomSearch', model = 'CatBoost')
df = df.append(df_reg_rand)
display(df)
Using the 5-Fold CatBoost Regressor to train, validate and predict using RandomSearch
------------------------------------------------------------------------------------------------------------------------
0:	learn: 0.8974974	test: 0.9370782	best: 0.9370782 (0)	total: 156ms	remaining: 2m 35s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2131646586
bestIteration = 152

Shrink model to first 153 iterations.
0:	learn: 0.8977644	test: 0.9356185	best: 0.9356185 (0)	total: 3.92ms	remaining: 3.92s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2743875529
bestIteration = 168

Shrink model to first 169 iterations.
0:	learn: 0.9067240	test: 0.9232859	best: 0.9232859 (0)	total: 3.23ms	remaining: 3.23s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.316746161
bestIteration = 136

Shrink model to first 137 iterations.
0:	learn: 0.8951573	test: 0.9250468	best: 0.9250468 (0)	total: 6.4ms	remaining: 6.39s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.3167610684
bestIteration = 110

Shrink model to first 111 iterations.
0:	learn: 0.9172922	test: 0.8452851	best: 0.8452851 (0)	total: 2.94ms	remaining: 2.94s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.2106142772
bestIteration = 101

Shrink model to first 102 iterations.
r2 Scores Train r2 Scores Val r2 Scores Test
5-Fold LinearRegression 0.7600 0.7504 0.6671
5-Fold LassoRegression 0.7587 0.7498 0.6653
5-Fold RidgeRegression 0.7600 0.7504 0.6671
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
5-Fold RandomForest 0.9861 0.9010 0.8777
5-Fold RandomForest GridSearchCV 0.9991 0.9092 0.8809
5-Fold RandomForest RandomizedSearchCV 0.9856 0.8993 0.8722
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
5-Fold RandomForest 0.9861 0.9010 0.8777
5-Fold RandomForest GridSearchCV 0.9991 0.9092 0.8809
5-Fold RandomForest RandomizedSearchCV 0.9862 0.9032 0.8776
5-Fold AdaBoost 0.8351 0.7922 0.7504
5-Fold AdaBoost using GridSearchCV 0.9936 0.9061 0.8670
5-Fold AdaBoost using RandomizedSearchCV 0.9936 0.9061 0.8670
5-Fold GradientBoost 0.9564 0.9019 0.8671
5-Fold GradientBoost using GridSearchCV 0.9990 0.9304 0.9012
5-Fold GradientBoost using RandomizedSearchCV 0.9991 0.9304 0.9011
5-Fold ExtraTrees 0.9991 0.9106 0.8762
5-Fold ExtraTrees using GridSearchCV 0.9991 0.9132 0.8762
5-Fold ExtraTrees using RandomizedSearchCV 0.9991 0.9129 0.8766
5-Fold CatBoost 0.9792 0.9296 0.8934
5-Fold CatBoost GridSearch 0.9774 0.9303 0.8871
5-Fold CatBoost RandomSearch 0.9861 0.9276 0.8872
CPU times: user 6 s, sys: 392 ms, total: 6.39 s
Wall time: 3.15 s

Bootstrapping Confidence Level

In [97]:
%%time
values = concrete_im.values
n_iterations = 500 # Number of bootstrap samples to create
n_size = int(len(concrete_im) * 1) # size of a bootstrap sample

# run bootstrap
stats = list() # empty list that will hold the scores for each bootstrap iteration
for i in range(n_iterations):
  # prepare train and test sets
  train = resample(values, n_samples = n_size) # Sampling with replacement 
  test = np.array([x for x in values if x.tolist() not in train.tolist()]) # picking rest of the data not considered in sample
  
  # fit model
  gb_reg_grid = GradientBoostingRegressor(random_state = random_state, **best_params_grid['GradientBoost'])
  gb_reg_grid.fit(train[:, :-1], train[:, -1]) # fit against independent variables and corresponding target values

  # evaluate model
  predictions = gb_reg_grid.predict(test[:, :-1]) # predict based on independent variables in the test data
  score = r2_score(test[:, -1], predictions)
  stats.append(score)
CPU times: user 8min 22s, sys: 3.93 s, total: 8min 26s
Wall time: 9min 6s
In [98]:
# plot scores
plt.figure(figsize = (15, 7.2))
plt.hist(stats); plt.show()

# confidence intervals
alpha = 0.95 # for 95% confidence 
p = ((1.0 - alpha) / 2.0) * 100 # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(stats, p))  

p = (alpha + ((1.0 - alpha) / 2.0)) * 100
upper = min(1.0, np.percentile(stats, p))

print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))
95.0 confidence interval 76.8% and 96.2%

Conclusion

  • In the EDA stage we explored the use of custom function to view IQR, Kurtosis, Skewness and other statistics and comment on whether the column has outliers or not. Performed further descriptive statistics to explore missing values and outliers. We also checked the relation between independent and dependent variables and observed that for some variables the relationship was linear. Used strategies such as visuals (boxplots), IQR method, studentized residual, leverage, Cook's D and DFFITS to analyze outliers and address leverage points.
  • In the feature engineering stage, we identified opportunities to add features based on variable interaction and gaussians but ended up adding multicollinearity in the data.
  • Used methods such as model based feature importance, eli5, correlation matrix, absolute correlation and variance inflation factor to understand important attributes.
  • Used cross validation method to compare linear and non-linear/tree-based models on training and validation sets. Here it was also important to check and see if there's any improvement after feature engineering. We found a significant improvement in the R2 scores (which was chosen as an evaluation criterion for the study) after performing exploratory data analysis and feature engineering steps.
  • It was also important to decide on whether to scale the data or not and if it's to be scaled then which method to be used. For this we ran a comparison analysis on both validation and test sets and saw that the results were almost similar.
  • Here we tried **3 linear regressions (Linear, Lasso and Ridge) and decision tree-based regression methods such as Decision Tree, Random Forest, AdaBoost, Gradient Boost and Extra Trees regressor**. We used **k-Fold cross validation, grid search and random search methods** to squeeze out the extra performance from the regressors. For some it resulted an improvement while for some it didn't owing to the limited hyperparameter space. For this specific problem, **Gradient Boost Regressor** turned out to be the best performing model when used with 5-fold cross validation and grid/random search with r2 for **training, validation and test as 0.999, 0.931 and 0.899 respectively**. Additionally, we also explored CatBoost regressor to explore interaction between features and modelling as well. None of the model were seen overfitting.
  • Then we also Bootstrapping method to calculate confidence intervals for Gradient Boost Regressor. We found a 95% likelihood of r2 score between 76.5% and 96.1% for Gradient Boost Regressor.
In [99]:
display(df)
r2 Scores Train r2 Scores Val r2 Scores Test
5-Fold LinearRegression 0.7600 0.7504 0.6671
5-Fold LassoRegression 0.7587 0.7498 0.6653
5-Fold RidgeRegression 0.7600 0.7504 0.6671
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
5-Fold RandomForest 0.9861 0.9010 0.8777
5-Fold RandomForest GridSearchCV 0.9991 0.9092 0.8809
5-Fold RandomForest RandomizedSearchCV 0.9856 0.8993 0.8722
5-Fold DecisionTree 0.9991 0.8394 0.7951
5-Fold DecisionTree GridSearchCV 0.9705 0.8294 0.7884
5-Fold DecisionTree RandomizedSearchCV 0.9705 0.8294 0.7884
5-Fold RandomForest 0.9861 0.9010 0.8777
5-Fold RandomForest GridSearchCV 0.9991 0.9092 0.8809
5-Fold RandomForest RandomizedSearchCV 0.9862 0.9032 0.8776
5-Fold AdaBoost 0.8351 0.7922 0.7504
5-Fold AdaBoost using GridSearchCV 0.9936 0.9061 0.8670
5-Fold AdaBoost using RandomizedSearchCV 0.9936 0.9061 0.8670
5-Fold GradientBoost 0.9564 0.9019 0.8671
5-Fold GradientBoost using GridSearchCV 0.9990 0.9304 0.9012
5-Fold GradientBoost using RandomizedSearchCV 0.9991 0.9304 0.9011
5-Fold ExtraTrees 0.9991 0.9106 0.8762
5-Fold ExtraTrees using GridSearchCV 0.9991 0.9132 0.8762
5-Fold ExtraTrees using RandomizedSearchCV 0.9991 0.9129 0.8766
5-Fold CatBoost 0.9792 0.9296 0.8934
5-Fold CatBoost GridSearch 0.9774 0.9303 0.8871
5-Fold CatBoost RandomSearch 0.9861 0.9276 0.8872
In [ ]: